# Anna KaRNNa

In this notebook, we'll build a character-wise RNN trained on Anna Karenina, one of my all-time favorite books. It'll be able to generate new text based on the text from the book.

This network is based off of Andrej Karpathy's [post on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and [implementation in Torch](https://github.com/karpathy/char-rnn). Also, some information [here at r2rt](http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html) and from [Sherjil Ozair](https://github.com/sherjilozair/char-rnn-tensorflow) on GitHub. Below is the general architecture of the character-wise RNN.

<img src="assets/charseq.jpeg" width="500">

In [1]:
import time
from collections import namedtuple

import numpy as np
import tensorflow as tf

First we'll load the text file and convert it into integers for our network to use. Here I'm creating a couple dictionaries to convert the characters to and from integers. Encoding the characters as integers makes it easier to use as input in the network.

In [2]:
with open('anna.txt', 'r') as f:
    text=f.read()
vocab = set(text)
vocab_to_int = {c: i for i, c in enumerate(vocab)}
int_to_vocab = dict(enumerate(vocab))
encoded = np.array([vocab_to_int[c] for c in text], dtype=np.int32)

In [3]:
## TESTS
#print(len(vocab)) --- 83
#print(int_to_vocab)  --- eg. 0:'c'
#print(len(int_to_vocab)) --- 83
#print(vocab_to_int)   --- eg. 'c':0
#print(len(vocab_to_int)) --- 72
print(vocab_to_int['C']) #--- 37

44


Let's check out the first 100 characters, make sure everything is peachy. According to the [American Book Review](http://americanbookreview.org/100bestlines.asp), this is the 6th best first line of a book ever.

In [4]:
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

And we can see the characters encoded as integers.

In [5]:
encoded[:100]

array([44,  0, 39, 18, 61, 60, 33, 26,  7,  1,  1,  1, 47, 39, 18, 18, 81,
       26, 41, 39, 23,  2, 27,  2, 60, 58, 26, 39, 33, 60, 26, 39, 27, 27,
       26, 39, 27,  2,  8, 60, 80, 26, 60, 69, 60, 33, 81, 26, 59,  6,  0,
       39, 18, 18, 81, 26, 41, 39, 23,  2, 27, 81, 26,  2, 58, 26, 59,  6,
        0, 39, 18, 18, 81, 26,  2,  6, 26,  2, 61, 58, 26, 32, 79,  6,  1,
       79, 39, 81, 25,  1,  1, 65, 69, 60, 33, 81, 61,  0,  2,  6])

Since the network is working with individual characters, it's similar to a classification problem in which we are trying to predict the next character from the previous text.  Here's how many 'classes' our network has to pick from.

In [6]:
len(vocab)

83

## Making training mini-batches

Here is where we'll make our mini-batches for training. Remember that we want our batches to be multiple sequences of some desired number of sequence steps. Considering a simple example, our batches would look like this:

<img src="assets/sequence_batching@1x.png" width=500px>


<br>
We have our text encoded as integers as one long array in `encoded`. Let's create a function that will give us an iterator for our batches. I like using [generator functions](https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/) to do this. Then we can pass `encoded` into this function and get our batch generator.

The first thing we need to do is discard some of the text so we only have completely full batches. Each batch contains $N \times M$ characters, where $N$ is the batch size (the number of sequences) and $M$ is the number of steps. Then, to get the number of batches we can make from some array `arr`, you divide the length of `arr` by the batch size. Once you know the number of batches and the batch size, you can get the total number of characters to keep.

After that, we need to split `arr` into $N$ sequences. You can do this using `arr.reshape(size)` where `size` is a tuple containing the dimensions sizes of the reshaped array. We know we want $N$ sequences (`n_seqs` below), let's make that the size of the first dimension. For the second dimension, you can use `-1` as a placeholder in the size, it'll fill up the array with the appropriate data for you. After this, you should have an array that is $N \times (M * K)$ where $K$ is the number of batches.

Now that we have this array, we can iterate through it to get our batches. The idea is each batch is a $N \times M$ window on the array. For each subsequent batch, the window moves over by `n_steps`. We also want to create both the input and target arrays. Remember that the targets are the inputs shifted over one character. You'll usually see the first input character used as the last target character, so something like this:
```python
y[:, :-1], y[:, -1] = x[:, 1:], x[:, 0]
```
where `x` is the input batch and `y` is the target batch.

The way I like to do this window is use `range` to take steps of size `n_steps` from $0$ to `arr.shape[1]`, the total number of steps in each sequence. That way, the integers you get from `range` always point to the start of a batch, and each window is `n_steps` wide.

> **Exercise:** Write the code for creating batches in the function below. The exercises in this notebook _will not be easy_. I've provided a notebook with solutions alongside this notebook. If you get stuck, checkout the solutions. The most important thing is that you don't copy and paste the code into here, **type out the solution code yourself.**

In [7]:
tx = np.array(encoded[:100])
tx = tx.reshape((10,-1))
tx.shape[1]
print(tx)
ty = np.zeros_like(tx)
ty[:, :-1], ty[:, -1] = tx[:, 1:], tx[:, 0]

[[44  0 39 18 61 60 33 26  7  1]
 [ 1  1 47 39 18 18 81 26 41 39]
 [23  2 27  2 60 58 26 39 33 60]
 [26 39 27 27 26 39 27  2  8 60]
 [80 26 60 69 60 33 81 26 59  6]
 [ 0 39 18 18 81 26 41 39 23  2]
 [27 81 26  2 58 26 59  6  0 39]
 [18 18 81 26  2  6 26  2 61 58]
 [26 32 79  6  1 79 39 81 25  1]
 [ 1 65 69 60 33 81 61  0  2  6]]


In [8]:
def get_batches(arr, n_seqs, n_steps):
    '''Create a generator that returns batches of size
       n_seqs x n_steps from arr.
       
       Arguments
       ---------
       arr: Array you want to make batches from
       n_seqs: Batch size, the number of sequences per batch
       n_steps: Number of sequence steps per batch
    '''
    # Get the batch size and number of batches we can make
    batch_size = n_seqs * n_steps
    n_batches = len(arr) // batch_size
    
    # Keep only enough characters to make full batches
    arr = arr [: n_batches * batch_size]
    
    # Reshape into n_seqs rows
    arr = arr.reshape((n_seqs, -1))
    
    for n in range(0, arr.shape[1], n_steps):
        # The features
        x = arr [:,n:n+n_steps ]
        # The targets, shifted by one
        y = np.zeros_like(x)
        #ty[:, :-1], ty[:, -1] = tx[:, 1:], tx[:, 0]
        y[:,:-1], y[:,-1] = x[:,1:], x[:,0] 
        yield x, y

Now I'll make my data sets and we can check out what's going on here. Here I'm going to use a batch size of 10 and 50 sequence steps.

In [9]:
batches = get_batches(encoded, 10, 50)
x, y = next(batches)

In [10]:
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[44  0 39 18 61 60 33 26  7  1]
 [26 39 23 26  6 32 61 26 56 32]
 [69  2  6 25  1  1 30 77 60 58]
 [ 6 26 24 59 33  2  6 56 26  0]
 [26  2 61 26  2 58 63 26 58  2]
 [26 46 61 26 79 39 58  1 32  6]
 [ 0 60  6 26 12 32 23 60 26 41]
 [80 26 10 59 61 26  6 32 79 26]
 [61 26  2 58  6 75 61 25 26 62]
 [26 58 39  2 24 26 61 32 26  0]]

y
 [[ 0 39 18 61 60 33 26  7  1  1]
 [39 23 26  6 32 61 26 56 32  2]
 [ 2  6 25  1  1 30 77 60 58 63]
 [26 24 59 33  2  6 56 26  0  2]
 [ 2 61 26  2 58 63 26 58  2 33]
 [46 61 26 79 39 58  1 32  6 27]
 [60  6 26 12 32 23 60 26 41 32]
 [26 10 59 61 26  6 32 79 26 58]
 [26  2 58  6 75 61 25 26 62  0]
 [58 39  2 24 26 61 32 26  0 60]]


If you implemented `get_batches` correctly, the above output should look something like 
```
x
 [[55 63 69 22  6 76 45  5 16 35]
 [ 5 69  1  5 12 52  6  5 56 52]
 [48 29 12 61 35 35  8 64 76 78]
 [12  5 24 39 45 29 12 56  5 63]
 [ 5 29  6  5 29 78 28  5 78 29]
 [ 5 13  6  5 36 69 78 35 52 12]
 [63 76 12  5 18 52  1 76  5 58]
 [34  5 73 39  6  5 12 52 36  5]
 [ 6  5 29 78 12 79  6 61  5 59]
 [ 5 78 69 29 24  5  6 52  5 63]]

y
 [[63 69 22  6 76 45  5 16 35 35]
 [69  1  5 12 52  6  5 56 52 29]
 [29 12 61 35 35  8 64 76 78 28]
 [ 5 24 39 45 29 12 56  5 63 29]
 [29  6  5 29 78 28  5 78 29 45]
 [13  6  5 36 69 78 35 52 12 43]
 [76 12  5 18 52  1 76  5 58 52]
 [ 5 73 39  6  5 12 52 36  5 78]
 [ 5 29 78 12 79  6 61  5 59 63]
 [78 69 29 24  5  6 52  5 63 76]]
 ```
 although the exact numbers will be different. Check to make sure the data is shifted over one step for `y`.

## Building the model

Below is where you'll build the network. We'll break it up into parts so it's easier to reason about each bit. Then we can connect them up into the whole network.

<img src="assets/charRNN.png" width=500px>


### Inputs

First off we'll create our input placeholders. As usual we need placeholders for the training data and the targets. We'll also create a placeholder for dropout layers called `keep_prob`. This will be a scalar, that is a 0-D tensor. To make a scalar, you create a placeholder without giving it a size.

> **Exercise:** Create the input placeholders in the function below.

In [11]:
def build_inputs(batch_size, num_steps):
    ''' Define placeholders for inputs, targets, and dropout 
    
        Arguments
        ---------
        batch_size: Batch size, number of sequences per batch
        num_steps: Number of sequence steps in a batch
        
    '''
    # Declare placeholders we'll feed into the graph
    inputs = tf.placeholder(tf.int32, [batch_size, num_steps], name='inputs')
    targets = tf.placeholder(tf.int32, [batch_size, num_steps], name='targets')
    
    # Keep probability placeholder for drop out layers
    keep_prob = tf.placeholder(tf.float32, name = 'keep_prob')
    
    return inputs, targets, keep_prob

### LSTM Cell

Here we will create the LSTM cell we'll use in the hidden layer. We'll use this cell as a building block for the RNN. So we aren't actually defining the RNN here, just the type of cell we'll use in the hidden layer.

We first create a basic LSTM cell with

```python
lstm = tf.contrib.rnn.BasicLSTMCell(num_units)
```

where `num_units` is the number of units in the hidden layers in the cell. Then we can add dropout by wrapping it with 

```python
tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
```
You pass in a cell and it will automatically add dropout to the inputs or outputs. Finally, we can stack up the LSTM cells into layers with [`tf.contrib.rnn.MultiRNNCell`](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/contrib/rnn/MultiRNNCell). With this, you pass in a list of cells and it will send the output of one cell into the next cell. For example,

```python
tf.contrib.rnn.MultiRNNCell([cell]*num_layers)
```

This might look a little weird if you know Python well because this will create a list of the same `cell` object. However, TensorFlow will create different weight matrices for all `cell` objects. Even though this is actually multiple LSTM cells stacked on each other, you can treat the multiple layers as one cell.

We also need to create an initial cell state of all zeros. This can be done like so

```python
initial_state = cell.zero_state(batch_size, tf.float32)
```

> **Exercise:** Below, implement the `build_lstm` function to create these LSTM cells and the initial state.

In [12]:
def build_lstm(lstm_size, num_layers, batch_size, keep_prob):
    ''' Build LSTM cell.
    
        Arguments
        ---------
        keep_prob: Scalar tensor (tf.placeholder) for the dropout keep probability
        lstm_size: Size of the hidden layers in the LSTM cells
        num_layers: Number of LSTM layers
        batch_size: Batch size

    '''
    ### Build the LSTM Cell
    # Use a basic LSTM cell
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    
    # Add dropout to the cell outputs
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob = keep_prob)
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.DropoutWrapper(tf.contrib.rnn.BasicLSTMCell(lstm_size), output_keep_prob = keep_prob) for _ in range(num_layers)])
    initial_state = cell.zero_state(batch_size, tf.float32)
    
    return cell, initial_state

### RNN Output

Here we'll create the output layer. We need to connect the output of the RNN cells to a full connected layer with a softmax output. The softmax output gives us a probability distribution we can use to predict the next character, so we want this layer to have size $C$, the number of classes/characters we have in our text.

If our input has batch size $N$, number of steps $M$, and the hidden layer has $L$ hidden units, then the output is a 3D tensor with size $N \times M \times L$. The output of each LSTM cell has size $L$, we have $M$ of them, one for each sequence step, and we have $N$ sequences. So the total size is $N \times M \times L$. 

We are using the same fully connected layer, the same weights, for each of the outputs. Then, to make things easier, we should reshape the outputs into a 2D tensor with shape $(M * N) \times L$. That is, one row for each sequence and step, where the values of each row are the output from the LSTM cells. We get the LSTM output as a list, `lstm_output`. First we need to concatenate this whole list into one array with [`tf.concat`](https://www.tensorflow.org/api_docs/python/tf/concat). Then, reshape it (with `tf.reshape`) to size $(M * N) \times L$.

One we have the outputs reshaped, we can do the matrix multiplication with the weights. We need to wrap the weight and bias variables in a variable scope with `tf.variable_scope(scope_name)` because there are weights being created in the LSTM cells. TensorFlow will throw an error if the weights created here have the same names as the weights created in the LSTM cells, which they will be default. To avoid this, we wrap the variables in a variable scope so we can give them unique names.

> **Exercise:** Implement the output layer in the function below.

In [13]:
def build_output(lstm_output, in_size, out_size):
    ''' Build a softmax layer, return the softmax output and logits.
    
        Arguments
        ---------
        
        lstm_output: List of output tensors from the LSTM layer
        in_size: Size of the input tensor, for example, size of the LSTM cells
        out_size: Size of this softmax layer
    
    '''

    # Reshape output so it's a bunch of rows, one row for each step for each sequence.
    # Concatenate lstm_output over axis 1 (the columns)
    seq_output = tf.concat(lstm_output, axis=1)
    # Reshape seq_output to a 2D tensor with lstm_size columns
    x = tf.reshape(seq_output, [-1, in_size])
    
    # Connect the RNN outputs to a softmax layer
    with tf.variable_scope('softmax'):
        # Create the weight and bias variables here
        softmax_w = tf.Variable(tf.truncated_normal((in_size, out_size), stddev=0.1))
        softmax_b = tf.Variable(tf.zeros(out_size))
    
    # Since output is a bunch of rows of RNN cell outputs, logits will be a bunch
    # of rows of logit outputs, one for each step and sequence
    logits = tf.matmul(x, softmax_w) + softmax_b
    
    # Use softmax to get the probabilities for predicted characters
    out = tf.nn.softmax(logits, name='predictions')
    
    return out, logits

### Training loss

Next up is the training loss. We get the logits and targets and calculate the softmax cross-entropy loss. First we need to one-hot encode the targets, we're getting them as encoded characters. Then, reshape the one-hot targets so it's a 2D tensor with size $(M*N) \times C$ where $C$ is the number of classes/characters we have. Remember that we reshaped the LSTM outputs and ran them through a fully connected layer with $C$ units. So our logits will also have size $(M*N) \times C$.

Then we run the logits and targets through `tf.nn.softmax_cross_entropy_with_logits` and find the mean to get the loss.

>**Exercise:** Implement the loss calculation in the function below.

In [14]:
def build_loss(logits, targets, lstm_size, num_classes):
    ''' Calculate the loss from the logits and the targets.
    
        Arguments
        ---------
        logits: Logits from final fully connected layer
        targets: Targets for supervised learning
        lstm_size: Number of LSTM hidden units
        num_classes: Number of classes in targets
        
    '''
    
    # One-hot encode targets and reshape to match logits, one row per sequence per step
    y_one_hot = tf.one_hot(targets, num_classes)
    y_reshaped =  tf.reshape(y_one_hot, logits.get_shape())
    
    # Softmax cross entropy loss
    loss = tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels=y_reshaped)
    loss = tf.reduce_mean(loss)
    
    return loss

### Optimizer

Here we build the optimizer. Normal RNNs have have issues gradients exploding and disappearing. LSTMs fix the disappearance problem, but the gradients can still grow without bound. To fix this, we can clip the gradients above some threshold. That is, if a gradient is larger than that threshold, we set it to the threshold. This will ensure the gradients never grow overly large. Then we use an AdamOptimizer for the learning step.

In [15]:
def build_optimizer(loss, learning_rate, grad_clip):
    ''' Build optmizer for training, using gradient clipping.
    
        Arguments:
        loss: Network loss
        learning_rate: Learning rate for optimizer
    
    '''
    
    # Optimizer for training, using gradient clipping to control exploding gradients
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), grad_clip)
    train_op = tf.train.AdamOptimizer(learning_rate)
    optimizer = train_op.apply_gradients(zip(grads, tvars))
    
    return optimizer

### Build the network

Now we can put all the pieces together and build a class for the network. To actually run data through the LSTM cells, we will use [`tf.nn.dynamic_rnn`](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/nn/dynamic_rnn). This function will pass the hidden and cell states across LSTM cells appropriately for us. It returns the outputs for each LSTM cell at each step for each sequence in the mini-batch. It also gives us the final LSTM state. We want to save this state as `final_state` so we can pass it to the first LSTM cell in the the next mini-batch run. For `tf.nn.dynamic_rnn`, we pass in the cell and initial state we get from `build_lstm`, as well as our input sequences. Also, we need to one-hot encode the inputs before going into the RNN. 

> **Exercise:** Use the functions you've implemented previously and `tf.nn.dynamic_rnn` to build the network.

In [16]:
class CharRNN:
    
    def __init__(self, num_classes, batch_size=64, num_steps=50, 
                       lstm_size=128, num_layers=2, learning_rate=0.001, 
                       grad_clip=5, sampling=False):
    
        # When we're using this network for sampling later, we'll be passing in
        # one character at a time, so providing an option for that
        if sampling == True:
            batch_size, num_steps = 1, 1
        else:
            batch_size, num_steps = batch_size, num_steps

        tf.reset_default_graph()
        
        # Build the input placeholder tensors
        self.inputs, self.targets, self.keep_prob = build_inputs(batch_size, num_steps)

        # Build the LSTM cell
        cell, self.initial_state = build_lstm(lstm_size, num_layers, batch_size, self.keep_prob)

        ### Run the data through the RNN layers
        # First, one-hot encode the input tokens
        x_one_hot = tf.one_hot(self.inputs, num_classes)
        
        # Run each sequence step through the RNN with tf.nn.dynamic_rnn 
        outputs, state = tf.nn.dynamic_rnn(cell, x_one_hot, initial_state = self.initial_state)
        self.final_state = state
        
        # Get softmax predictions and logits
        self.prediction, self.logits = build_output(outputs, lstm_size, num_classes)
        
        # Loss and optimizer (with gradient clipping)
        self.loss =  build_loss(self.logits, self.targets, lstm_size, num_classes)
        self.optimizer = build_optimizer(self.loss, learning_rate, grad_clip)

## Hyperparameters

Here are the hyperparameters for the network.

* `batch_size` - Number of sequences running through the network in one pass.
* `num_steps` - Number of characters in the sequence the network is trained on. Larger is better typically, the network will learn more long range dependencies. But it takes longer to train. 100 is typically a good number here.
* `lstm_size` - The number of units in the hidden layers.
* `num_layers` - Number of hidden LSTM layers to use
* `learning_rate` - Learning rate for training
* `keep_prob` - The dropout keep probability when training. If you're network is overfitting, try decreasing this.

Here's some good advice from Andrej Karpathy on training the network. I'm going to copy it in here for your benefit, but also link to [where it originally came from](https://github.com/karpathy/char-rnn#tips-and-tricks).

> ## Tips and Tricks

>### Monitoring Validation Loss vs. Training Loss
>If you're somewhat new to Machine Learning or Neural Networks it can take a bit of expertise to get good models. The most important quantity to keep track of is the difference between your training loss (printed during training) and the validation loss (printed once in a while when the RNN is run on the validation data (by default every 1000 iterations)). In particular:

> - If your training loss is much lower than validation loss then this means the network might be **overfitting**. Solutions to this are to decrease your network size, or to increase dropout. For example you could try dropout of 0.5 and so on.
> - If your training/validation loss are about equal then your model is **underfitting**. Increase the size of your model (either number of layers or the raw number of neurons per layer)

> ### Approximate number of parameters

> The two most important parameters that control the model are `lstm_size` and `num_layers`. I would advise that you always use `num_layers` of either 2/3. The `lstm_size` can be adjusted based on how much data you have. The two important quantities to keep track of here are:

> - The number of parameters in your model. This is printed when you start training.
> - The size of your dataset. 1MB file is approximately 1 million characters.

>These two should be about the same order of magnitude. It's a little tricky to tell. Here are some examples:

> - I have a 100MB dataset and I'm using the default parameter settings (which currently print 150K parameters). My data size is significantly larger (100 mil >> 0.15 mil), so I expect to heavily underfit. I am thinking I can comfortably afford to make `lstm_size` larger.
> - I have a 10MB dataset and running a 10 million parameter model. I'm slightly nervous and I'm carefully monitoring my validation loss. If it's larger than my training loss then I may want to try to increase dropout a bit and see if that helps the validation loss.

> ### Best models strategy

>The winning strategy to obtaining very good models (if you have the compute time) is to always err on making the network larger (as large as you're willing to wait for it to compute) and then try different dropout values (between 0,1). Whatever model has the best validation performance (the loss, written in the checkpoint filename, low is good) is the one you should use in the end.

>It is very common in deep learning to run many different models with many different hyperparameter settings, and in the end take whatever checkpoint gave the best validation performance.

>By the way, the size of your training and validation splits are also parameters. Make sure you have a decent amount of data in your validation set or otherwise the validation performance will be noisy and not very informative.

In [17]:
batch_size = 10         # Sequences per batch
num_steps = 100          # Number of sequence steps per batch
lstm_size = 512         # Size of hidden layers in LSTMs
num_layers = 2          # Number of LSTM layers
learning_rate = 0.001    # Learning rate
keep_prob = 0.5         # Dropout keep probability

## Time for training

This is typical training code, passing inputs and targets into the network, then running the optimizer. Here we also get back the final LSTM state for the mini-batch. Then, we pass that state back into the network so the next batch can continue the state from the previous batch. And every so often (set by `save_every_n`) I save a checkpoint.

Here I'm saving checkpoints with the format

`i{iteration number}_l{# hidden layer units}.ckpt`

> **Exercise:** Set the hyperparameters above to train the network. Watch the training loss, it should be consistently dropping. Also, I highly advise running this on a GPU.

In [18]:
epochs = 5
# Save every N iterations
save_every_n = 200

model = CharRNN(len(vocab), batch_size=batch_size, num_steps=num_steps,
                lstm_size=lstm_size, num_layers=num_layers, 
                learning_rate=learning_rate)

saver = tf.train.Saver(max_to_keep=100)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    # Use the line below to load a checkpoint and resume training
    #saver.restore(sess, 'checkpoints/______.ckpt')
    counter = 0
    for e in range(epochs):
        # Train network
        new_state = sess.run(model.initial_state)
        loss = 0
        for x, y in get_batches(encoded, batch_size, num_steps):
            counter += 1
            start = time.time()
            feed = {model.inputs: x,
                    model.targets: y,
                    model.keep_prob: keep_prob,
                    model.initial_state: new_state}
            batch_loss, new_state, _ = sess.run([model.loss, 
                                                 model.final_state, 
                                                 model.optimizer], 
                                                 feed_dict=feed)
            
            end = time.time()
            print('Epoch: {}/{}... '.format(e+1, epochs),
                  'Training Step: {}... '.format(counter),
                  'Training loss: {:.4f}... '.format(batch_loss),
                  '{:.4f} sec/batch'.format((end-start)))
        
            if (counter % save_every_n == 0):
                saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))
    
    saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))

Epoch: 1/5...  Training Step: 1...  Training loss: 4.4167...  1.4562 sec/batch
Epoch: 1/5...  Training Step: 2...  Training loss: 4.3206...  1.4905 sec/batch
Epoch: 1/5...  Training Step: 3...  Training loss: 3.8235...  1.6609 sec/batch
Epoch: 1/5...  Training Step: 4...  Training loss: 5.2638...  1.6548 sec/batch
Epoch: 1/5...  Training Step: 5...  Training loss: 4.2985...  1.6026 sec/batch
Epoch: 1/5...  Training Step: 6...  Training loss: 3.8881...  1.6786 sec/batch
Epoch: 1/5...  Training Step: 7...  Training loss: 3.8695...  1.7779 sec/batch
Epoch: 1/5...  Training Step: 8...  Training loss: 3.5358...  1.6801 sec/batch
Epoch: 1/5...  Training Step: 9...  Training loss: 3.5166...  1.7056 sec/batch
Epoch: 1/5...  Training Step: 10...  Training loss: 3.4921...  1.7035 sec/batch
Epoch: 1/5...  Training Step: 11...  Training loss: 3.4714...  1.7978 sec/batch
Epoch: 1/5...  Training Step: 12...  Training loss: 3.4594...  1.8163 sec/batch
Epoch: 1/5...  Training Step: 13...  Training los

Epoch: 1/5...  Training Step: 104...  Training loss: 3.1273...  1.9669 sec/batch
Epoch: 1/5...  Training Step: 105...  Training loss: 3.0883...  1.8360 sec/batch
Epoch: 1/5...  Training Step: 106...  Training loss: 3.1022...  1.8706 sec/batch
Epoch: 1/5...  Training Step: 107...  Training loss: 3.1158...  1.9374 sec/batch
Epoch: 1/5...  Training Step: 108...  Training loss: 3.1339...  1.8009 sec/batch
Epoch: 1/5...  Training Step: 109...  Training loss: 3.0825...  1.8187 sec/batch
Epoch: 1/5...  Training Step: 110...  Training loss: 3.0790...  1.7736 sec/batch
Epoch: 1/5...  Training Step: 111...  Training loss: 3.1293...  1.8110 sec/batch
Epoch: 1/5...  Training Step: 112...  Training loss: 3.0854...  1.8051 sec/batch
Epoch: 1/5...  Training Step: 113...  Training loss: 3.1742...  1.8758 sec/batch
Epoch: 1/5...  Training Step: 114...  Training loss: 3.1485...  1.8881 sec/batch
Epoch: 1/5...  Training Step: 115...  Training loss: 3.0982...  1.8165 sec/batch
Epoch: 1/5...  Training Step

Epoch: 1/5...  Training Step: 206...  Training loss: 2.6647...  1.8167 sec/batch
Epoch: 1/5...  Training Step: 207...  Training loss: 2.6512...  1.8001 sec/batch
Epoch: 1/5...  Training Step: 208...  Training loss: 2.5393...  1.8799 sec/batch
Epoch: 1/5...  Training Step: 209...  Training loss: 2.5650...  1.8250 sec/batch
Epoch: 1/5...  Training Step: 210...  Training loss: 2.5615...  1.9232 sec/batch
Epoch: 1/5...  Training Step: 211...  Training loss: 2.6353...  1.7876 sec/batch
Epoch: 1/5...  Training Step: 212...  Training loss: 2.6206...  1.7793 sec/batch
Epoch: 1/5...  Training Step: 213...  Training loss: 2.5986...  1.8468 sec/batch
Epoch: 1/5...  Training Step: 214...  Training loss: 2.5213...  1.7827 sec/batch
Epoch: 1/5...  Training Step: 215...  Training loss: 2.5511...  1.7901 sec/batch
Epoch: 1/5...  Training Step: 216...  Training loss: 2.5269...  1.8108 sec/batch
Epoch: 1/5...  Training Step: 217...  Training loss: 2.5442...  1.8879 sec/batch
Epoch: 1/5...  Training Step

Epoch: 1/5...  Training Step: 308...  Training loss: 2.3484...  1.8702 sec/batch
Epoch: 1/5...  Training Step: 309...  Training loss: 2.4047...  1.8592 sec/batch
Epoch: 1/5...  Training Step: 310...  Training loss: 2.3248...  1.8725 sec/batch
Epoch: 1/5...  Training Step: 311...  Training loss: 2.3361...  1.8602 sec/batch
Epoch: 1/5...  Training Step: 312...  Training loss: 2.3857...  1.8263 sec/batch
Epoch: 1/5...  Training Step: 313...  Training loss: 2.3404...  1.8336 sec/batch
Epoch: 1/5...  Training Step: 314...  Training loss: 2.3431...  1.7918 sec/batch
Epoch: 1/5...  Training Step: 315...  Training loss: 2.4768...  1.7932 sec/batch
Epoch: 1/5...  Training Step: 316...  Training loss: 2.3484...  1.8310 sec/batch
Epoch: 1/5...  Training Step: 317...  Training loss: 2.4213...  1.8969 sec/batch
Epoch: 1/5...  Training Step: 318...  Training loss: 2.4690...  1.8375 sec/batch
Epoch: 1/5...  Training Step: 319...  Training loss: 2.3698...  1.8066 sec/batch
Epoch: 1/5...  Training Step

Epoch: 1/5...  Training Step: 410...  Training loss: 2.2941...  1.8526 sec/batch
Epoch: 1/5...  Training Step: 411...  Training loss: 2.2587...  1.8350 sec/batch
Epoch: 1/5...  Training Step: 412...  Training loss: 2.2037...  1.8117 sec/batch
Epoch: 1/5...  Training Step: 413...  Training loss: 2.2692...  1.8665 sec/batch
Epoch: 1/5...  Training Step: 414...  Training loss: 2.2678...  1.8277 sec/batch
Epoch: 1/5...  Training Step: 415...  Training loss: 2.2577...  1.8187 sec/batch
Epoch: 1/5...  Training Step: 416...  Training loss: 2.2790...  1.8179 sec/batch
Epoch: 1/5...  Training Step: 417...  Training loss: 2.2702...  1.8799 sec/batch
Epoch: 1/5...  Training Step: 418...  Training loss: 2.2981...  1.7914 sec/batch
Epoch: 1/5...  Training Step: 419...  Training loss: 2.2235...  1.8179 sec/batch
Epoch: 1/5...  Training Step: 420...  Training loss: 2.2731...  1.8137 sec/batch
Epoch: 1/5...  Training Step: 421...  Training loss: 2.2642...  1.8159 sec/batch
Epoch: 1/5...  Training Step

Epoch: 1/5...  Training Step: 512...  Training loss: 2.1559...  1.8267 sec/batch
Epoch: 1/5...  Training Step: 513...  Training loss: 2.1886...  1.8226 sec/batch
Epoch: 1/5...  Training Step: 514...  Training loss: 2.1076...  1.7711 sec/batch
Epoch: 1/5...  Training Step: 515...  Training loss: 2.1760...  1.7918 sec/batch
Epoch: 1/5...  Training Step: 516...  Training loss: 2.0856...  1.8159 sec/batch
Epoch: 1/5...  Training Step: 517...  Training loss: 2.0935...  1.8778 sec/batch
Epoch: 1/5...  Training Step: 518...  Training loss: 2.2037...  1.8955 sec/batch
Epoch: 1/5...  Training Step: 519...  Training loss: 2.0620...  1.8135 sec/batch
Epoch: 1/5...  Training Step: 520...  Training loss: 2.1614...  1.8320 sec/batch
Epoch: 1/5...  Training Step: 521...  Training loss: 2.1476...  1.8214 sec/batch
Epoch: 1/5...  Training Step: 522...  Training loss: 2.2251...  1.8115 sec/batch
Epoch: 1/5...  Training Step: 523...  Training loss: 2.2595...  1.9493 sec/batch
Epoch: 1/5...  Training Step

Epoch: 1/5...  Training Step: 614...  Training loss: 2.0811...  1.9121 sec/batch
Epoch: 1/5...  Training Step: 615...  Training loss: 2.0887...  1.8741 sec/batch
Epoch: 1/5...  Training Step: 616...  Training loss: 2.0452...  1.9019 sec/batch
Epoch: 1/5...  Training Step: 617...  Training loss: 2.0983...  1.9709 sec/batch
Epoch: 1/5...  Training Step: 618...  Training loss: 2.0685...  2.0336 sec/batch
Epoch: 1/5...  Training Step: 619...  Training loss: 2.0138...  1.9372 sec/batch
Epoch: 1/5...  Training Step: 620...  Training loss: 1.9848...  2.0743 sec/batch
Epoch: 1/5...  Training Step: 621...  Training loss: 1.9638...  1.9557 sec/batch
Epoch: 1/5...  Training Step: 622...  Training loss: 2.0139...  1.9387 sec/batch
Epoch: 1/5...  Training Step: 623...  Training loss: 2.0429...  2.1929 sec/batch
Epoch: 1/5...  Training Step: 624...  Training loss: 2.0715...  1.9066 sec/batch
Epoch: 1/5...  Training Step: 625...  Training loss: 2.0117...  1.8689 sec/batch
Epoch: 1/5...  Training Step

Epoch: 1/5...  Training Step: 716...  Training loss: 1.9961...  1.8327 sec/batch
Epoch: 1/5...  Training Step: 717...  Training loss: 2.0878...  1.8444 sec/batch
Epoch: 1/5...  Training Step: 718...  Training loss: 2.0148...  1.8079 sec/batch
Epoch: 1/5...  Training Step: 719...  Training loss: 2.0243...  1.8220 sec/batch
Epoch: 1/5...  Training Step: 720...  Training loss: 2.0700...  1.8372 sec/batch
Epoch: 1/5...  Training Step: 721...  Training loss: 2.0055...  1.8339 sec/batch
Epoch: 1/5...  Training Step: 722...  Training loss: 2.0861...  1.7958 sec/batch
Epoch: 1/5...  Training Step: 723...  Training loss: 2.1205...  1.8001 sec/batch
Epoch: 1/5...  Training Step: 724...  Training loss: 2.1489...  1.8541 sec/batch
Epoch: 1/5...  Training Step: 725...  Training loss: 2.0098...  1.8528 sec/batch
Epoch: 1/5...  Training Step: 726...  Training loss: 1.9724...  1.8440 sec/batch
Epoch: 1/5...  Training Step: 727...  Training loss: 2.1350...  1.8141 sec/batch
Epoch: 1/5...  Training Step

Epoch: 1/5...  Training Step: 818...  Training loss: 1.9885...  1.7928 sec/batch
Epoch: 1/5...  Training Step: 819...  Training loss: 1.9177...  1.8029 sec/batch
Epoch: 1/5...  Training Step: 820...  Training loss: 1.9611...  1.8231 sec/batch
Epoch: 1/5...  Training Step: 821...  Training loss: 1.8787...  1.8038 sec/batch
Epoch: 1/5...  Training Step: 822...  Training loss: 1.8723...  1.7828 sec/batch
Epoch: 1/5...  Training Step: 823...  Training loss: 1.9025...  1.7553 sec/batch
Epoch: 1/5...  Training Step: 824...  Training loss: 1.9487...  1.8401 sec/batch
Epoch: 1/5...  Training Step: 825...  Training loss: 1.8998...  1.8452 sec/batch
Epoch: 1/5...  Training Step: 826...  Training loss: 1.9791...  1.7897 sec/batch
Epoch: 1/5...  Training Step: 827...  Training loss: 1.9560...  1.8542 sec/batch
Epoch: 1/5...  Training Step: 828...  Training loss: 1.8666...  1.8370 sec/batch
Epoch: 1/5...  Training Step: 829...  Training loss: 1.9984...  1.8961 sec/batch
Epoch: 1/5...  Training Step

Epoch: 1/5...  Training Step: 920...  Training loss: 1.8613...  1.8337 sec/batch
Epoch: 1/5...  Training Step: 921...  Training loss: 1.9408...  1.8179 sec/batch
Epoch: 1/5...  Training Step: 922...  Training loss: 1.7540...  1.8929 sec/batch
Epoch: 1/5...  Training Step: 923...  Training loss: 1.9723...  1.8586 sec/batch
Epoch: 1/5...  Training Step: 924...  Training loss: 1.9277...  1.8763 sec/batch
Epoch: 1/5...  Training Step: 925...  Training loss: 1.8983...  1.8510 sec/batch
Epoch: 1/5...  Training Step: 926...  Training loss: 1.9561...  1.8041 sec/batch
Epoch: 1/5...  Training Step: 927...  Training loss: 2.0412...  1.8151 sec/batch
Epoch: 1/5...  Training Step: 928...  Training loss: 1.9074...  1.7979 sec/batch
Epoch: 1/5...  Training Step: 929...  Training loss: 2.0691...  1.7961 sec/batch
Epoch: 1/5...  Training Step: 930...  Training loss: 1.9100...  1.8492 sec/batch
Epoch: 1/5...  Training Step: 931...  Training loss: 1.9572...  1.8096 sec/batch
Epoch: 1/5...  Training Step

Epoch: 1/5...  Training Step: 1021...  Training loss: 1.8287...  1.7918 sec/batch
Epoch: 1/5...  Training Step: 1022...  Training loss: 1.8456...  1.8701 sec/batch
Epoch: 1/5...  Training Step: 1023...  Training loss: 1.8943...  1.8147 sec/batch
Epoch: 1/5...  Training Step: 1024...  Training loss: 1.8327...  1.8358 sec/batch
Epoch: 1/5...  Training Step: 1025...  Training loss: 1.9390...  1.7989 sec/batch
Epoch: 1/5...  Training Step: 1026...  Training loss: 1.8692...  1.7884 sec/batch
Epoch: 1/5...  Training Step: 1027...  Training loss: 1.7853...  1.8352 sec/batch
Epoch: 1/5...  Training Step: 1028...  Training loss: 1.8225...  1.8080 sec/batch
Epoch: 1/5...  Training Step: 1029...  Training loss: 1.7815...  1.8039 sec/batch
Epoch: 1/5...  Training Step: 1030...  Training loss: 1.8042...  1.9525 sec/batch
Epoch: 1/5...  Training Step: 1031...  Training loss: 1.8265...  1.8809 sec/batch
Epoch: 1/5...  Training Step: 1032...  Training loss: 1.8853...  1.8331 sec/batch
Epoch: 1/5...  T

Epoch: 1/5...  Training Step: 1121...  Training loss: 1.9230...  1.8793 sec/batch
Epoch: 1/5...  Training Step: 1122...  Training loss: 1.8484...  1.8440 sec/batch
Epoch: 1/5...  Training Step: 1123...  Training loss: 2.0810...  1.8116 sec/batch
Epoch: 1/5...  Training Step: 1124...  Training loss: 1.8090...  1.8960 sec/batch
Epoch: 1/5...  Training Step: 1125...  Training loss: 1.7697...  1.9008 sec/batch
Epoch: 1/5...  Training Step: 1126...  Training loss: 1.7751...  1.8770 sec/batch
Epoch: 1/5...  Training Step: 1127...  Training loss: 1.6949...  1.8180 sec/batch
Epoch: 1/5...  Training Step: 1128...  Training loss: 1.8267...  1.8102 sec/batch
Epoch: 1/5...  Training Step: 1129...  Training loss: 1.8522...  1.8102 sec/batch
Epoch: 1/5...  Training Step: 1130...  Training loss: 1.8662...  1.8299 sec/batch
Epoch: 1/5...  Training Step: 1131...  Training loss: 1.8322...  1.8374 sec/batch
Epoch: 1/5...  Training Step: 1132...  Training loss: 1.8628...  1.8047 sec/batch
Epoch: 1/5...  T

Epoch: 1/5...  Training Step: 1221...  Training loss: 1.7566...  1.7673 sec/batch
Epoch: 1/5...  Training Step: 1222...  Training loss: 1.7990...  1.8021 sec/batch
Epoch: 1/5...  Training Step: 1223...  Training loss: 1.7341...  1.8372 sec/batch
Epoch: 1/5...  Training Step: 1224...  Training loss: 1.7442...  1.8999 sec/batch
Epoch: 1/5...  Training Step: 1225...  Training loss: 1.8726...  1.8069 sec/batch
Epoch: 1/5...  Training Step: 1226...  Training loss: 1.8528...  1.8063 sec/batch
Epoch: 1/5...  Training Step: 1227...  Training loss: 1.7870...  1.7853 sec/batch
Epoch: 1/5...  Training Step: 1228...  Training loss: 1.7835...  1.7953 sec/batch
Epoch: 1/5...  Training Step: 1229...  Training loss: 1.9172...  1.7939 sec/batch
Epoch: 1/5...  Training Step: 1230...  Training loss: 1.9143...  1.7909 sec/batch
Epoch: 1/5...  Training Step: 1231...  Training loss: 1.7399...  1.9512 sec/batch
Epoch: 1/5...  Training Step: 1232...  Training loss: 1.6398...  1.9054 sec/batch
Epoch: 1/5...  T

Epoch: 1/5...  Training Step: 1321...  Training loss: 1.7465...  1.7368 sec/batch
Epoch: 1/5...  Training Step: 1322...  Training loss: 1.6842...  1.7760 sec/batch
Epoch: 1/5...  Training Step: 1323...  Training loss: 1.8322...  1.8022 sec/batch
Epoch: 1/5...  Training Step: 1324...  Training loss: 1.7698...  1.9548 sec/batch
Epoch: 1/5...  Training Step: 1325...  Training loss: 1.7998...  1.9380 sec/batch
Epoch: 1/5...  Training Step: 1326...  Training loss: 1.7746...  1.8646 sec/batch
Epoch: 1/5...  Training Step: 1327...  Training loss: 1.8293...  1.7828 sec/batch
Epoch: 1/5...  Training Step: 1328...  Training loss: 1.7871...  1.7543 sec/batch
Epoch: 1/5...  Training Step: 1329...  Training loss: 1.7481...  1.7578 sec/batch
Epoch: 1/5...  Training Step: 1330...  Training loss: 1.7585...  1.8587 sec/batch
Epoch: 1/5...  Training Step: 1331...  Training loss: 1.6960...  1.9584 sec/batch
Epoch: 1/5...  Training Step: 1332...  Training loss: 1.7004...  1.8089 sec/batch
Epoch: 1/5...  T

Epoch: 1/5...  Training Step: 1421...  Training loss: 1.8234...  1.8241 sec/batch
Epoch: 1/5...  Training Step: 1422...  Training loss: 1.5926...  1.7788 sec/batch
Epoch: 1/5...  Training Step: 1423...  Training loss: 1.7605...  1.8100 sec/batch
Epoch: 1/5...  Training Step: 1424...  Training loss: 1.7365...  1.9295 sec/batch
Epoch: 1/5...  Training Step: 1425...  Training loss: 1.7957...  1.8528 sec/batch
Epoch: 1/5...  Training Step: 1426...  Training loss: 1.8262...  1.8630 sec/batch
Epoch: 1/5...  Training Step: 1427...  Training loss: 1.7512...  1.8509 sec/batch
Epoch: 1/5...  Training Step: 1428...  Training loss: 1.7579...  1.8766 sec/batch
Epoch: 1/5...  Training Step: 1429...  Training loss: 1.7645...  1.9228 sec/batch
Epoch: 1/5...  Training Step: 1430...  Training loss: 1.6471...  1.8885 sec/batch
Epoch: 1/5...  Training Step: 1431...  Training loss: 1.6823...  1.8645 sec/batch
Epoch: 1/5...  Training Step: 1432...  Training loss: 1.7244...  1.7698 sec/batch
Epoch: 1/5...  T

Epoch: 1/5...  Training Step: 1521...  Training loss: 1.6699...  1.8594 sec/batch
Epoch: 1/5...  Training Step: 1522...  Training loss: 1.7075...  1.8554 sec/batch
Epoch: 1/5...  Training Step: 1523...  Training loss: 1.7880...  1.9681 sec/batch
Epoch: 1/5...  Training Step: 1524...  Training loss: 1.7304...  1.7920 sec/batch
Epoch: 1/5...  Training Step: 1525...  Training loss: 1.8108...  1.8248 sec/batch
Epoch: 1/5...  Training Step: 1526...  Training loss: 1.7734...  1.8429 sec/batch
Epoch: 1/5...  Training Step: 1527...  Training loss: 1.8579...  1.8466 sec/batch
Epoch: 1/5...  Training Step: 1528...  Training loss: 1.6637...  1.8562 sec/batch
Epoch: 1/5...  Training Step: 1529...  Training loss: 1.7830...  1.8227 sec/batch
Epoch: 1/5...  Training Step: 1530...  Training loss: 1.7281...  1.8319 sec/batch
Epoch: 1/5...  Training Step: 1531...  Training loss: 1.7706...  1.8320 sec/batch
Epoch: 1/5...  Training Step: 1532...  Training loss: 1.7155...  1.8640 sec/batch
Epoch: 1/5...  T

Epoch: 1/5...  Training Step: 1621...  Training loss: 1.6700...  1.8388 sec/batch
Epoch: 1/5...  Training Step: 1622...  Training loss: 1.7179...  1.8650 sec/batch
Epoch: 1/5...  Training Step: 1623...  Training loss: 1.7083...  1.8778 sec/batch
Epoch: 1/5...  Training Step: 1624...  Training loss: 1.8007...  1.9319 sec/batch
Epoch: 1/5...  Training Step: 1625...  Training loss: 1.7827...  1.8888 sec/batch
Epoch: 1/5...  Training Step: 1626...  Training loss: 1.7418...  1.8569 sec/batch
Epoch: 1/5...  Training Step: 1627...  Training loss: 1.8271...  1.7960 sec/batch
Epoch: 1/5...  Training Step: 1628...  Training loss: 1.6735...  1.8913 sec/batch
Epoch: 1/5...  Training Step: 1629...  Training loss: 1.6242...  1.7700 sec/batch
Epoch: 1/5...  Training Step: 1630...  Training loss: 1.7189...  1.7617 sec/batch
Epoch: 1/5...  Training Step: 1631...  Training loss: 1.6919...  1.7756 sec/batch
Epoch: 1/5...  Training Step: 1632...  Training loss: 1.5909...  1.8031 sec/batch
Epoch: 1/5...  T

Epoch: 1/5...  Training Step: 1721...  Training loss: 1.6256...  1.7820 sec/batch
Epoch: 1/5...  Training Step: 1722...  Training loss: 1.5947...  1.8204 sec/batch
Epoch: 1/5...  Training Step: 1723...  Training loss: 1.6883...  1.7718 sec/batch
Epoch: 1/5...  Training Step: 1724...  Training loss: 1.6281...  1.7676 sec/batch
Epoch: 1/5...  Training Step: 1725...  Training loss: 1.6426...  1.7938 sec/batch
Epoch: 1/5...  Training Step: 1726...  Training loss: 1.7026...  1.8139 sec/batch
Epoch: 1/5...  Training Step: 1727...  Training loss: 1.6080...  1.7467 sec/batch
Epoch: 1/5...  Training Step: 1728...  Training loss: 1.6306...  1.8289 sec/batch
Epoch: 1/5...  Training Step: 1729...  Training loss: 1.5941...  1.8823 sec/batch
Epoch: 1/5...  Training Step: 1730...  Training loss: 1.6174...  1.7845 sec/batch
Epoch: 1/5...  Training Step: 1731...  Training loss: 1.6389...  1.8017 sec/batch
Epoch: 1/5...  Training Step: 1732...  Training loss: 1.6958...  1.7636 sec/batch
Epoch: 1/5...  T

Epoch: 1/5...  Training Step: 1821...  Training loss: 1.7743...  1.7798 sec/batch
Epoch: 1/5...  Training Step: 1822...  Training loss: 1.8048...  1.8089 sec/batch
Epoch: 1/5...  Training Step: 1823...  Training loss: 1.5676...  1.7758 sec/batch
Epoch: 1/5...  Training Step: 1824...  Training loss: 1.7695...  1.7267 sec/batch
Epoch: 1/5...  Training Step: 1825...  Training loss: 1.7000...  1.7754 sec/batch
Epoch: 1/5...  Training Step: 1826...  Training loss: 1.6870...  1.7819 sec/batch
Epoch: 1/5...  Training Step: 1827...  Training loss: 1.6511...  1.7764 sec/batch
Epoch: 1/5...  Training Step: 1828...  Training loss: 1.8922...  1.7802 sec/batch
Epoch: 1/5...  Training Step: 1829...  Training loss: 1.6555...  1.7856 sec/batch
Epoch: 1/5...  Training Step: 1830...  Training loss: 1.7155...  1.7667 sec/batch
Epoch: 1/5...  Training Step: 1831...  Training loss: 1.7186...  1.7854 sec/batch
Epoch: 1/5...  Training Step: 1832...  Training loss: 1.7380...  1.7879 sec/batch
Epoch: 1/5...  T

Epoch: 1/5...  Training Step: 1921...  Training loss: 1.7791...  1.7456 sec/batch
Epoch: 1/5...  Training Step: 1922...  Training loss: 2.1515...  1.7370 sec/batch
Epoch: 1/5...  Training Step: 1923...  Training loss: 1.7399...  1.7894 sec/batch
Epoch: 1/5...  Training Step: 1924...  Training loss: 1.6374...  1.8166 sec/batch
Epoch: 1/5...  Training Step: 1925...  Training loss: 1.5372...  1.7614 sec/batch
Epoch: 1/5...  Training Step: 1926...  Training loss: 1.8185...  1.8029 sec/batch
Epoch: 1/5...  Training Step: 1927...  Training loss: 1.7189...  1.8377 sec/batch
Epoch: 1/5...  Training Step: 1928...  Training loss: 1.6933...  1.7515 sec/batch
Epoch: 1/5...  Training Step: 1929...  Training loss: 1.6938...  1.8041 sec/batch
Epoch: 1/5...  Training Step: 1930...  Training loss: 1.7261...  1.7778 sec/batch
Epoch: 1/5...  Training Step: 1931...  Training loss: 1.6498...  1.7578 sec/batch
Epoch: 1/5...  Training Step: 1932...  Training loss: 1.6748...  1.7703 sec/batch
Epoch: 1/5...  T

Epoch: 2/5...  Training Step: 2021...  Training loss: 1.6454...  1.7944 sec/batch
Epoch: 2/5...  Training Step: 2022...  Training loss: 1.6266...  1.7716 sec/batch
Epoch: 2/5...  Training Step: 2023...  Training loss: 1.5336...  1.8167 sec/batch
Epoch: 2/5...  Training Step: 2024...  Training loss: 1.5210...  1.8982 sec/batch
Epoch: 2/5...  Training Step: 2025...  Training loss: 1.6039...  2.0103 sec/batch
Epoch: 2/5...  Training Step: 2026...  Training loss: 1.4982...  2.0383 sec/batch
Epoch: 2/5...  Training Step: 2027...  Training loss: 1.5785...  2.0304 sec/batch
Epoch: 2/5...  Training Step: 2028...  Training loss: 1.4786...  2.0115 sec/batch
Epoch: 2/5...  Training Step: 2029...  Training loss: 1.5713...  2.0570 sec/batch
Epoch: 2/5...  Training Step: 2030...  Training loss: 1.5648...  1.9655 sec/batch
Epoch: 2/5...  Training Step: 2031...  Training loss: 1.5614...  1.8416 sec/batch
Epoch: 2/5...  Training Step: 2032...  Training loss: 1.5966...  1.8142 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 2121...  Training loss: 1.6179...  1.8250 sec/batch
Epoch: 2/5...  Training Step: 2122...  Training loss: 1.7859...  1.8049 sec/batch
Epoch: 2/5...  Training Step: 2123...  Training loss: 1.6279...  1.8199 sec/batch
Epoch: 2/5...  Training Step: 2124...  Training loss: 1.6143...  1.8732 sec/batch
Epoch: 2/5...  Training Step: 2125...  Training loss: 1.5629...  1.8047 sec/batch
Epoch: 2/5...  Training Step: 2126...  Training loss: 1.4898...  1.8044 sec/batch
Epoch: 2/5...  Training Step: 2127...  Training loss: 1.5164...  1.8045 sec/batch
Epoch: 2/5...  Training Step: 2128...  Training loss: 1.5120...  1.8197 sec/batch
Epoch: 2/5...  Training Step: 2129...  Training loss: 1.4894...  1.8352 sec/batch
Epoch: 2/5...  Training Step: 2130...  Training loss: 1.5292...  1.9478 sec/batch
Epoch: 2/5...  Training Step: 2131...  Training loss: 1.5659...  2.0370 sec/batch
Epoch: 2/5...  Training Step: 2132...  Training loss: 1.4517...  1.8412 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 2221...  Training loss: 1.5375...  1.9122 sec/batch
Epoch: 2/5...  Training Step: 2222...  Training loss: 1.6259...  1.8813 sec/batch
Epoch: 2/5...  Training Step: 2223...  Training loss: 1.5732...  1.9880 sec/batch
Epoch: 2/5...  Training Step: 2224...  Training loss: 1.4788...  1.9459 sec/batch
Epoch: 2/5...  Training Step: 2225...  Training loss: 1.5842...  1.8941 sec/batch
Epoch: 2/5...  Training Step: 2226...  Training loss: 1.5195...  1.8066 sec/batch
Epoch: 2/5...  Training Step: 2227...  Training loss: 1.6045...  1.8038 sec/batch
Epoch: 2/5...  Training Step: 2228...  Training loss: 1.6677...  1.8050 sec/batch
Epoch: 2/5...  Training Step: 2229...  Training loss: 1.6834...  1.8201 sec/batch
Epoch: 2/5...  Training Step: 2230...  Training loss: 1.5986...  1.8516 sec/batch
Epoch: 2/5...  Training Step: 2231...  Training loss: 1.5898...  1.7890 sec/batch
Epoch: 2/5...  Training Step: 2232...  Training loss: 1.6827...  1.8578 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 2321...  Training loss: 1.5717...  1.9154 sec/batch
Epoch: 2/5...  Training Step: 2322...  Training loss: 1.5683...  1.7914 sec/batch
Epoch: 2/5...  Training Step: 2323...  Training loss: 1.4993...  1.8726 sec/batch
Epoch: 2/5...  Training Step: 2324...  Training loss: 1.6682...  1.8457 sec/batch
Epoch: 2/5...  Training Step: 2325...  Training loss: 1.6276...  1.8088 sec/batch
Epoch: 2/5...  Training Step: 2326...  Training loss: 1.5190...  1.9354 sec/batch
Epoch: 2/5...  Training Step: 2327...  Training loss: 1.5647...  1.9906 sec/batch
Epoch: 2/5...  Training Step: 2328...  Training loss: 1.5774...  1.8269 sec/batch
Epoch: 2/5...  Training Step: 2329...  Training loss: 1.5521...  1.8750 sec/batch
Epoch: 2/5...  Training Step: 2330...  Training loss: 1.4764...  1.8673 sec/batch
Epoch: 2/5...  Training Step: 2331...  Training loss: 1.6802...  1.8031 sec/batch
Epoch: 2/5...  Training Step: 2332...  Training loss: 1.6239...  1.7872 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 2421...  Training loss: 1.5190...  1.8895 sec/batch
Epoch: 2/5...  Training Step: 2422...  Training loss: 1.5893...  1.8667 sec/batch
Epoch: 2/5...  Training Step: 2423...  Training loss: 1.6384...  1.8829 sec/batch
Epoch: 2/5...  Training Step: 2424...  Training loss: 1.6475...  1.9200 sec/batch
Epoch: 2/5...  Training Step: 2425...  Training loss: 1.5286...  1.9370 sec/batch
Epoch: 2/5...  Training Step: 2426...  Training loss: 1.7688...  1.8782 sec/batch
Epoch: 2/5...  Training Step: 2427...  Training loss: 1.6114...  1.9709 sec/batch
Epoch: 2/5...  Training Step: 2428...  Training loss: 1.4998...  2.1001 sec/batch
Epoch: 2/5...  Training Step: 2429...  Training loss: 1.5083...  2.1136 sec/batch
Epoch: 2/5...  Training Step: 2430...  Training loss: 1.5486...  2.0746 sec/batch
Epoch: 2/5...  Training Step: 2431...  Training loss: 1.5836...  2.0838 sec/batch
Epoch: 2/5...  Training Step: 2432...  Training loss: 1.7230...  2.0764 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 2521...  Training loss: 1.4565...  1.8714 sec/batch
Epoch: 2/5...  Training Step: 2522...  Training loss: 1.3880...  1.8837 sec/batch
Epoch: 2/5...  Training Step: 2523...  Training loss: 1.4255...  1.9029 sec/batch
Epoch: 2/5...  Training Step: 2524...  Training loss: 1.5023...  1.9618 sec/batch
Epoch: 2/5...  Training Step: 2525...  Training loss: 1.4379...  1.9216 sec/batch
Epoch: 2/5...  Training Step: 2526...  Training loss: 1.4748...  1.9380 sec/batch
Epoch: 2/5...  Training Step: 2527...  Training loss: 1.4254...  2.0101 sec/batch
Epoch: 2/5...  Training Step: 2528...  Training loss: 1.4403...  1.8850 sec/batch
Epoch: 2/5...  Training Step: 2529...  Training loss: 1.4441...  1.8989 sec/batch
Epoch: 2/5...  Training Step: 2530...  Training loss: 1.6268...  1.9121 sec/batch
Epoch: 2/5...  Training Step: 2531...  Training loss: 1.6489...  1.9060 sec/batch
Epoch: 2/5...  Training Step: 2532...  Training loss: 1.5040...  1.8933 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 2621...  Training loss: 1.5388...  1.8940 sec/batch
Epoch: 2/5...  Training Step: 2622...  Training loss: 1.5014...  1.8772 sec/batch
Epoch: 2/5...  Training Step: 2623...  Training loss: 1.4720...  1.9278 sec/batch
Epoch: 2/5...  Training Step: 2624...  Training loss: 1.6209...  1.9133 sec/batch
Epoch: 2/5...  Training Step: 2625...  Training loss: 1.5103...  1.8990 sec/batch
Epoch: 2/5...  Training Step: 2626...  Training loss: 1.5875...  2.1390 sec/batch
Epoch: 2/5...  Training Step: 2627...  Training loss: 1.6033...  1.9306 sec/batch
Epoch: 2/5...  Training Step: 2628...  Training loss: 1.5580...  1.9611 sec/batch
Epoch: 2/5...  Training Step: 2629...  Training loss: 1.5051...  1.9083 sec/batch
Epoch: 2/5...  Training Step: 2630...  Training loss: 1.6456...  1.9449 sec/batch
Epoch: 2/5...  Training Step: 2631...  Training loss: 1.6304...  1.9111 sec/batch
Epoch: 2/5...  Training Step: 2632...  Training loss: 1.5558...  1.9310 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 2721...  Training loss: 1.5712...  1.9285 sec/batch
Epoch: 2/5...  Training Step: 2722...  Training loss: 1.5894...  2.0320 sec/batch
Epoch: 2/5...  Training Step: 2723...  Training loss: 1.4397...  2.0661 sec/batch
Epoch: 2/5...  Training Step: 2724...  Training loss: 1.6356...  2.0136 sec/batch
Epoch: 2/5...  Training Step: 2725...  Training loss: 1.5397...  2.0194 sec/batch
Epoch: 2/5...  Training Step: 2726...  Training loss: 1.4896...  1.9863 sec/batch
Epoch: 2/5...  Training Step: 2727...  Training loss: 1.5451...  1.8712 sec/batch
Epoch: 2/5...  Training Step: 2728...  Training loss: 1.5892...  1.8920 sec/batch
Epoch: 2/5...  Training Step: 2729...  Training loss: 1.5469...  1.9109 sec/batch
Epoch: 2/5...  Training Step: 2730...  Training loss: 1.6244...  1.8669 sec/batch
Epoch: 2/5...  Training Step: 2731...  Training loss: 1.5109...  1.9215 sec/batch
Epoch: 2/5...  Training Step: 2732...  Training loss: 1.5774...  1.9741 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 2821...  Training loss: 1.5290...  1.9228 sec/batch
Epoch: 2/5...  Training Step: 2822...  Training loss: 1.5471...  1.8649 sec/batch
Epoch: 2/5...  Training Step: 2823...  Training loss: 1.6062...  1.9026 sec/batch
Epoch: 2/5...  Training Step: 2824...  Training loss: 1.6067...  1.8157 sec/batch
Epoch: 2/5...  Training Step: 2825...  Training loss: 1.5203...  1.8856 sec/batch
Epoch: 2/5...  Training Step: 2826...  Training loss: 1.5896...  1.8885 sec/batch
Epoch: 2/5...  Training Step: 2827...  Training loss: 1.5490...  1.8813 sec/batch
Epoch: 2/5...  Training Step: 2828...  Training loss: 1.4825...  1.8503 sec/batch
Epoch: 2/5...  Training Step: 2829...  Training loss: 1.6170...  1.8999 sec/batch
Epoch: 2/5...  Training Step: 2830...  Training loss: 1.5146...  1.8888 sec/batch
Epoch: 2/5...  Training Step: 2831...  Training loss: 1.4925...  1.8587 sec/batch
Epoch: 2/5...  Training Step: 2832...  Training loss: 1.4463...  1.8979 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 2921...  Training loss: 1.4019...  1.8838 sec/batch
Epoch: 2/5...  Training Step: 2922...  Training loss: 1.4223...  1.8618 sec/batch
Epoch: 2/5...  Training Step: 2923...  Training loss: 1.5464...  1.8685 sec/batch
Epoch: 2/5...  Training Step: 2924...  Training loss: 1.4852...  1.8943 sec/batch
Epoch: 2/5...  Training Step: 2925...  Training loss: 1.4964...  1.8925 sec/batch
Epoch: 2/5...  Training Step: 2926...  Training loss: 1.5233...  1.8884 sec/batch
Epoch: 2/5...  Training Step: 2927...  Training loss: 1.6181...  1.9179 sec/batch
Epoch: 2/5...  Training Step: 2928...  Training loss: 1.4253...  1.9374 sec/batch
Epoch: 2/5...  Training Step: 2929...  Training loss: 1.6563...  1.8754 sec/batch
Epoch: 2/5...  Training Step: 2930...  Training loss: 1.4386...  1.8748 sec/batch
Epoch: 2/5...  Training Step: 2931...  Training loss: 1.5964...  1.9280 sec/batch
Epoch: 2/5...  Training Step: 2932...  Training loss: 1.4765...  1.8637 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 3021...  Training loss: 1.5680...  1.8981 sec/batch
Epoch: 2/5...  Training Step: 3022...  Training loss: 1.6029...  1.8811 sec/batch
Epoch: 2/5...  Training Step: 3023...  Training loss: 1.5146...  2.0411 sec/batch
Epoch: 2/5...  Training Step: 3024...  Training loss: 1.4841...  1.9049 sec/batch
Epoch: 2/5...  Training Step: 3025...  Training loss: 1.4802...  1.9993 sec/batch
Epoch: 2/5...  Training Step: 3026...  Training loss: 1.4647...  1.8508 sec/batch
Epoch: 2/5...  Training Step: 3027...  Training loss: 1.4819...  1.8808 sec/batch
Epoch: 2/5...  Training Step: 3028...  Training loss: 1.5096...  1.9142 sec/batch
Epoch: 2/5...  Training Step: 3029...  Training loss: 1.5949...  1.9127 sec/batch
Epoch: 2/5...  Training Step: 3030...  Training loss: 1.4895...  1.8835 sec/batch
Epoch: 2/5...  Training Step: 3031...  Training loss: 1.5105...  1.8778 sec/batch
Epoch: 2/5...  Training Step: 3032...  Training loss: 1.4324...  1.9444 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 3121...  Training loss: 1.4469...  1.8819 sec/batch
Epoch: 2/5...  Training Step: 3122...  Training loss: 1.3491...  1.8431 sec/batch
Epoch: 2/5...  Training Step: 3123...  Training loss: 1.3594...  2.1960 sec/batch
Epoch: 2/5...  Training Step: 3124...  Training loss: 1.5084...  1.9875 sec/batch
Epoch: 2/5...  Training Step: 3125...  Training loss: 1.5056...  1.9806 sec/batch
Epoch: 2/5...  Training Step: 3126...  Training loss: 1.4175...  1.9576 sec/batch
Epoch: 2/5...  Training Step: 3127...  Training loss: 1.4847...  1.8307 sec/batch
Epoch: 2/5...  Training Step: 3128...  Training loss: 1.3824...  2.0460 sec/batch
Epoch: 2/5...  Training Step: 3129...  Training loss: 1.3716...  2.0421 sec/batch
Epoch: 2/5...  Training Step: 3130...  Training loss: 1.4922...  1.8942 sec/batch
Epoch: 2/5...  Training Step: 3131...  Training loss: 1.3759...  1.8673 sec/batch
Epoch: 2/5...  Training Step: 3132...  Training loss: 1.4079...  1.9548 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 3221...  Training loss: 1.4651...  1.8076 sec/batch
Epoch: 2/5...  Training Step: 3222...  Training loss: 1.3675...  1.8529 sec/batch
Epoch: 2/5...  Training Step: 3223...  Training loss: 1.3723...  1.9121 sec/batch
Epoch: 2/5...  Training Step: 3224...  Training loss: 1.3680...  1.8520 sec/batch
Epoch: 2/5...  Training Step: 3225...  Training loss: 1.4444...  1.8745 sec/batch
Epoch: 2/5...  Training Step: 3226...  Training loss: 1.3617...  1.8804 sec/batch
Epoch: 2/5...  Training Step: 3227...  Training loss: 1.5012...  1.8951 sec/batch
Epoch: 2/5...  Training Step: 3228...  Training loss: 1.4456...  1.8549 sec/batch
Epoch: 2/5...  Training Step: 3229...  Training loss: 1.3685...  1.9066 sec/batch
Epoch: 2/5...  Training Step: 3230...  Training loss: 1.5192...  1.8698 sec/batch
Epoch: 2/5...  Training Step: 3231...  Training loss: 1.5005...  1.9870 sec/batch
Epoch: 2/5...  Training Step: 3232...  Training loss: 1.4415...  1.9920 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 3321...  Training loss: 1.6309...  1.8086 sec/batch
Epoch: 2/5...  Training Step: 3322...  Training loss: 1.6005...  1.8079 sec/batch
Epoch: 2/5...  Training Step: 3323...  Training loss: 1.3922...  1.8426 sec/batch
Epoch: 2/5...  Training Step: 3324...  Training loss: 1.4633...  1.8017 sec/batch
Epoch: 2/5...  Training Step: 3325...  Training loss: 1.4324...  1.8124 sec/batch
Epoch: 2/5...  Training Step: 3326...  Training loss: 1.4486...  1.8098 sec/batch
Epoch: 2/5...  Training Step: 3327...  Training loss: 1.4340...  1.8680 sec/batch
Epoch: 2/5...  Training Step: 3328...  Training loss: 1.3867...  1.7883 sec/batch
Epoch: 2/5...  Training Step: 3329...  Training loss: 1.4242...  1.8187 sec/batch
Epoch: 2/5...  Training Step: 3330...  Training loss: 1.4904...  1.8504 sec/batch
Epoch: 2/5...  Training Step: 3331...  Training loss: 1.5127...  1.8646 sec/batch
Epoch: 2/5...  Training Step: 3332...  Training loss: 1.4502...  1.8669 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 3421...  Training loss: 1.3505...  1.8281 sec/batch
Epoch: 2/5...  Training Step: 3422...  Training loss: 1.4454...  1.8472 sec/batch
Epoch: 2/5...  Training Step: 3423...  Training loss: 1.6045...  1.8276 sec/batch
Epoch: 2/5...  Training Step: 3424...  Training loss: 1.4731...  1.8272 sec/batch
Epoch: 2/5...  Training Step: 3425...  Training loss: 1.5557...  1.8280 sec/batch
Epoch: 2/5...  Training Step: 3426...  Training loss: 1.6204...  1.8374 sec/batch
Epoch: 2/5...  Training Step: 3427...  Training loss: 1.4791...  1.8730 sec/batch
Epoch: 2/5...  Training Step: 3428...  Training loss: 1.4068...  1.8656 sec/batch
Epoch: 2/5...  Training Step: 3429...  Training loss: 1.4674...  1.8474 sec/batch
Epoch: 2/5...  Training Step: 3430...  Training loss: 1.4387...  1.8419 sec/batch
Epoch: 2/5...  Training Step: 3431...  Training loss: 1.4596...  1.8820 sec/batch
Epoch: 2/5...  Training Step: 3432...  Training loss: 1.4880...  1.8294 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 3521...  Training loss: 1.4585...  1.8191 sec/batch
Epoch: 2/5...  Training Step: 3522...  Training loss: 1.4082...  1.8344 sec/batch
Epoch: 2/5...  Training Step: 3523...  Training loss: 1.4376...  1.7901 sec/batch
Epoch: 2/5...  Training Step: 3524...  Training loss: 1.4760...  1.7890 sec/batch
Epoch: 2/5...  Training Step: 3525...  Training loss: 1.3889...  1.7950 sec/batch
Epoch: 2/5...  Training Step: 3526...  Training loss: 1.3775...  1.8196 sec/batch
Epoch: 2/5...  Training Step: 3527...  Training loss: 1.4236...  1.8589 sec/batch
Epoch: 2/5...  Training Step: 3528...  Training loss: 1.3745...  1.8202 sec/batch
Epoch: 2/5...  Training Step: 3529...  Training loss: 1.4274...  1.8074 sec/batch
Epoch: 2/5...  Training Step: 3530...  Training loss: 1.4302...  1.8314 sec/batch
Epoch: 2/5...  Training Step: 3531...  Training loss: 1.3218...  1.8279 sec/batch
Epoch: 2/5...  Training Step: 3532...  Training loss: 1.5470...  1.8459 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 3621...  Training loss: 1.4452...  1.8204 sec/batch
Epoch: 2/5...  Training Step: 3622...  Training loss: 1.4282...  1.8169 sec/batch
Epoch: 2/5...  Training Step: 3623...  Training loss: 1.3714...  1.8028 sec/batch
Epoch: 2/5...  Training Step: 3624...  Training loss: 1.4488...  1.8247 sec/batch
Epoch: 2/5...  Training Step: 3625...  Training loss: 1.3972...  1.8478 sec/batch
Epoch: 2/5...  Training Step: 3626...  Training loss: 1.3922...  1.8418 sec/batch
Epoch: 2/5...  Training Step: 3627...  Training loss: 1.4428...  1.8475 sec/batch
Epoch: 2/5...  Training Step: 3628...  Training loss: 1.3429...  1.8234 sec/batch
Epoch: 2/5...  Training Step: 3629...  Training loss: 1.3631...  1.7958 sec/batch
Epoch: 2/5...  Training Step: 3630...  Training loss: 1.5177...  1.9415 sec/batch
Epoch: 2/5...  Training Step: 3631...  Training loss: 1.5290...  1.9322 sec/batch
Epoch: 2/5...  Training Step: 3632...  Training loss: 1.4761...  1.9010 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 3721...  Training loss: 1.3902...  1.8314 sec/batch
Epoch: 2/5...  Training Step: 3722...  Training loss: 1.5020...  1.7998 sec/batch
Epoch: 2/5...  Training Step: 3723...  Training loss: 1.3439...  1.8343 sec/batch
Epoch: 2/5...  Training Step: 3724...  Training loss: 1.5355...  1.8257 sec/batch
Epoch: 2/5...  Training Step: 3725...  Training loss: 1.3573...  1.8504 sec/batch
Epoch: 2/5...  Training Step: 3726...  Training loss: 1.4541...  1.8949 sec/batch
Epoch: 2/5...  Training Step: 3727...  Training loss: 1.4791...  1.8396 sec/batch
Epoch: 2/5...  Training Step: 3728...  Training loss: 1.5258...  1.8546 sec/batch
Epoch: 2/5...  Training Step: 3729...  Training loss: 1.5651...  1.8487 sec/batch
Epoch: 2/5...  Training Step: 3730...  Training loss: 1.5247...  1.8306 sec/batch
Epoch: 2/5...  Training Step: 3731...  Training loss: 1.5931...  1.8410 sec/batch
Epoch: 2/5...  Training Step: 3732...  Training loss: 1.4284...  1.8554 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 3821...  Training loss: 1.4536...  1.7988 sec/batch
Epoch: 2/5...  Training Step: 3822...  Training loss: 1.5197...  1.8090 sec/batch
Epoch: 2/5...  Training Step: 3823...  Training loss: 1.5201...  1.8249 sec/batch
Epoch: 2/5...  Training Step: 3824...  Training loss: 1.5146...  1.8073 sec/batch
Epoch: 2/5...  Training Step: 3825...  Training loss: 1.6188...  1.8799 sec/batch
Epoch: 2/5...  Training Step: 3826...  Training loss: 1.4954...  1.8551 sec/batch
Epoch: 2/5...  Training Step: 3827...  Training loss: 1.3308...  1.8071 sec/batch
Epoch: 2/5...  Training Step: 3828...  Training loss: 1.5217...  1.8220 sec/batch
Epoch: 2/5...  Training Step: 3829...  Training loss: 1.4656...  1.7914 sec/batch
Epoch: 2/5...  Training Step: 3830...  Training loss: 1.4099...  1.8379 sec/batch
Epoch: 2/5...  Training Step: 3831...  Training loss: 1.4861...  1.8826 sec/batch
Epoch: 2/5...  Training Step: 3832...  Training loss: 1.5601...  1.8799 sec/batch
Epoch: 2/5...  T

Epoch: 2/5...  Training Step: 3921...  Training loss: 1.4904...  1.7992 sec/batch
Epoch: 2/5...  Training Step: 3922...  Training loss: 1.4772...  1.8051 sec/batch
Epoch: 2/5...  Training Step: 3923...  Training loss: 1.4398...  1.7928 sec/batch
Epoch: 2/5...  Training Step: 3924...  Training loss: 1.4633...  1.8060 sec/batch
Epoch: 2/5...  Training Step: 3925...  Training loss: 1.4409...  1.7920 sec/batch
Epoch: 2/5...  Training Step: 3926...  Training loss: 1.5174...  1.8172 sec/batch
Epoch: 2/5...  Training Step: 3927...  Training loss: 1.4548...  1.8250 sec/batch
Epoch: 2/5...  Training Step: 3928...  Training loss: 1.4465...  1.7841 sec/batch
Epoch: 2/5...  Training Step: 3929...  Training loss: 1.5517...  1.8100 sec/batch
Epoch: 2/5...  Training Step: 3930...  Training loss: 1.4149...  1.8163 sec/batch
Epoch: 2/5...  Training Step: 3931...  Training loss: 1.4781...  1.7825 sec/batch
Epoch: 2/5...  Training Step: 3932...  Training loss: 1.4843...  1.8137 sec/batch
Epoch: 2/5...  T

Epoch: 3/5...  Training Step: 4021...  Training loss: 1.4650...  1.7779 sec/batch
Epoch: 3/5...  Training Step: 4022...  Training loss: 1.4153...  1.8029 sec/batch
Epoch: 3/5...  Training Step: 4023...  Training loss: 1.4151...  1.8320 sec/batch
Epoch: 3/5...  Training Step: 4024...  Training loss: 1.3244...  1.8486 sec/batch
Epoch: 3/5...  Training Step: 4025...  Training loss: 1.3952...  1.8192 sec/batch
Epoch: 3/5...  Training Step: 4026...  Training loss: 1.3751...  1.8537 sec/batch
Epoch: 3/5...  Training Step: 4027...  Training loss: 1.4589...  1.9084 sec/batch
Epoch: 3/5...  Training Step: 4028...  Training loss: 1.3690...  1.8642 sec/batch
Epoch: 3/5...  Training Step: 4029...  Training loss: 1.3173...  1.8712 sec/batch
Epoch: 3/5...  Training Step: 4030...  Training loss: 1.3380...  1.8055 sec/batch
Epoch: 3/5...  Training Step: 4031...  Training loss: 1.3412...  1.8083 sec/batch
Epoch: 3/5...  Training Step: 4032...  Training loss: 1.3198...  1.8327 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 4121...  Training loss: 1.4320...  1.8250 sec/batch
Epoch: 3/5...  Training Step: 4122...  Training loss: 1.4286...  1.8176 sec/batch
Epoch: 3/5...  Training Step: 4123...  Training loss: 1.3812...  1.8311 sec/batch
Epoch: 3/5...  Training Step: 4124...  Training loss: 1.5169...  1.8126 sec/batch
Epoch: 3/5...  Training Step: 4125...  Training loss: 1.5357...  1.8638 sec/batch
Epoch: 3/5...  Training Step: 4126...  Training loss: 1.4740...  1.8251 sec/batch
Epoch: 3/5...  Training Step: 4127...  Training loss: 1.4356...  1.8318 sec/batch
Epoch: 3/5...  Training Step: 4128...  Training loss: 1.5329...  1.8478 sec/batch
Epoch: 3/5...  Training Step: 4129...  Training loss: 1.3963...  1.8300 sec/batch
Epoch: 3/5...  Training Step: 4130...  Training loss: 1.5210...  1.8169 sec/batch
Epoch: 3/5...  Training Step: 4131...  Training loss: 1.3735...  1.8407 sec/batch
Epoch: 3/5...  Training Step: 4132...  Training loss: 1.4491...  1.8106 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 4221...  Training loss: 1.3959...  1.8038 sec/batch
Epoch: 3/5...  Training Step: 4222...  Training loss: 1.3608...  1.7819 sec/batch
Epoch: 3/5...  Training Step: 4223...  Training loss: 1.3783...  1.7957 sec/batch
Epoch: 3/5...  Training Step: 4224...  Training loss: 1.5115...  1.7769 sec/batch
Epoch: 3/5...  Training Step: 4225...  Training loss: 1.5543...  1.8094 sec/batch
Epoch: 3/5...  Training Step: 4226...  Training loss: 1.5345...  1.7774 sec/batch
Epoch: 3/5...  Training Step: 4227...  Training loss: 1.4322...  1.8170 sec/batch
Epoch: 3/5...  Training Step: 4228...  Training loss: 1.5394...  1.7978 sec/batch
Epoch: 3/5...  Training Step: 4229...  Training loss: 1.4959...  1.7920 sec/batch
Epoch: 3/5...  Training Step: 4230...  Training loss: 1.4005...  1.8076 sec/batch
Epoch: 3/5...  Training Step: 4231...  Training loss: 1.4597...  1.8109 sec/batch
Epoch: 3/5...  Training Step: 4232...  Training loss: 1.4207...  1.7840 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 4321...  Training loss: 1.4540...  1.9539 sec/batch
Epoch: 3/5...  Training Step: 4322...  Training loss: 1.4393...  1.9484 sec/batch
Epoch: 3/5...  Training Step: 4323...  Training loss: 1.4221...  1.9292 sec/batch
Epoch: 3/5...  Training Step: 4324...  Training loss: 1.3824...  1.8930 sec/batch
Epoch: 3/5...  Training Step: 4325...  Training loss: 1.4484...  1.8903 sec/batch
Epoch: 3/5...  Training Step: 4326...  Training loss: 1.4219...  1.8702 sec/batch
Epoch: 3/5...  Training Step: 4327...  Training loss: 1.4959...  1.8460 sec/batch
Epoch: 3/5...  Training Step: 4328...  Training loss: 1.3476...  1.8509 sec/batch
Epoch: 3/5...  Training Step: 4329...  Training loss: 1.3644...  1.8478 sec/batch
Epoch: 3/5...  Training Step: 4330...  Training loss: 1.3697...  1.8571 sec/batch
Epoch: 3/5...  Training Step: 4331...  Training loss: 1.3834...  1.8508 sec/batch
Epoch: 3/5...  Training Step: 4332...  Training loss: 1.4233...  1.8920 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 4421...  Training loss: 1.3000...  1.8781 sec/batch
Epoch: 3/5...  Training Step: 4422...  Training loss: 1.3680...  1.8614 sec/batch
Epoch: 3/5...  Training Step: 4423...  Training loss: 1.4897...  1.9638 sec/batch
Epoch: 3/5...  Training Step: 4424...  Training loss: 1.5388...  1.9527 sec/batch
Epoch: 3/5...  Training Step: 4425...  Training loss: 1.5283...  1.8374 sec/batch
Epoch: 3/5...  Training Step: 4426...  Training loss: 1.4781...  1.8893 sec/batch
Epoch: 3/5...  Training Step: 4427...  Training loss: 1.5228...  1.8629 sec/batch
Epoch: 3/5...  Training Step: 4428...  Training loss: 1.4673...  1.9377 sec/batch
Epoch: 3/5...  Training Step: 4429...  Training loss: 1.3657...  1.8880 sec/batch
Epoch: 3/5...  Training Step: 4430...  Training loss: 1.3665...  1.9391 sec/batch
Epoch: 3/5...  Training Step: 4431...  Training loss: 1.3805...  1.8702 sec/batch
Epoch: 3/5...  Training Step: 4432...  Training loss: 1.4036...  1.8970 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 4521...  Training loss: 1.3096...  1.7910 sec/batch
Epoch: 3/5...  Training Step: 4522...  Training loss: 1.3460...  1.7858 sec/batch
Epoch: 3/5...  Training Step: 4523...  Training loss: 1.3637...  1.7660 sec/batch
Epoch: 3/5...  Training Step: 4524...  Training loss: 1.4582...  1.7901 sec/batch
Epoch: 3/5...  Training Step: 4525...  Training loss: 1.3048...  1.7978 sec/batch
Epoch: 3/5...  Training Step: 4526...  Training loss: 1.4115...  1.7992 sec/batch
Epoch: 3/5...  Training Step: 4527...  Training loss: 1.3517...  1.8288 sec/batch
Epoch: 3/5...  Training Step: 4528...  Training loss: 1.4826...  1.7478 sec/batch
Epoch: 3/5...  Training Step: 4529...  Training loss: 1.3508...  1.8461 sec/batch
Epoch: 3/5...  Training Step: 4530...  Training loss: 1.2314...  1.8245 sec/batch
Epoch: 3/5...  Training Step: 4531...  Training loss: 1.4514...  1.7845 sec/batch
Epoch: 3/5...  Training Step: 4532...  Training loss: 1.3328...  1.7863 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 4621...  Training loss: 1.4472...  1.8773 sec/batch
Epoch: 3/5...  Training Step: 4622...  Training loss: 1.4339...  1.8416 sec/batch
Epoch: 3/5...  Training Step: 4623...  Training loss: 1.4274...  1.8230 sec/batch
Epoch: 3/5...  Training Step: 4624...  Training loss: 1.5584...  1.9037 sec/batch
Epoch: 3/5...  Training Step: 4625...  Training loss: 1.4717...  1.8977 sec/batch
Epoch: 3/5...  Training Step: 4626...  Training loss: 1.4359...  1.8844 sec/batch
Epoch: 3/5...  Training Step: 4627...  Training loss: 1.4509...  1.8635 sec/batch
Epoch: 3/5...  Training Step: 4628...  Training loss: 1.4956...  1.9275 sec/batch
Epoch: 3/5...  Training Step: 4629...  Training loss: 1.4123...  1.8743 sec/batch
Epoch: 3/5...  Training Step: 4630...  Training loss: 1.3681...  1.8089 sec/batch
Epoch: 3/5...  Training Step: 4631...  Training loss: 1.4514...  1.8215 sec/batch
Epoch: 3/5...  Training Step: 4632...  Training loss: 1.4159...  1.8455 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 4721...  Training loss: 1.3956...  2.0261 sec/batch
Epoch: 3/5...  Training Step: 4722...  Training loss: 1.5950...  1.9570 sec/batch
Epoch: 3/5...  Training Step: 4723...  Training loss: 1.2524...  1.8521 sec/batch
Epoch: 3/5...  Training Step: 4724...  Training loss: 1.4295...  1.8489 sec/batch
Epoch: 3/5...  Training Step: 4725...  Training loss: 1.3458...  1.8337 sec/batch
Epoch: 3/5...  Training Step: 4726...  Training loss: 1.3111...  1.8699 sec/batch
Epoch: 3/5...  Training Step: 4727...  Training loss: 1.3599...  1.8900 sec/batch
Epoch: 3/5...  Training Step: 4728...  Training loss: 1.4327...  1.8633 sec/batch
Epoch: 3/5...  Training Step: 4729...  Training loss: 1.4533...  1.8040 sec/batch
Epoch: 3/5...  Training Step: 4730...  Training loss: 1.4318...  1.8462 sec/batch
Epoch: 3/5...  Training Step: 4731...  Training loss: 1.4515...  1.8604 sec/batch
Epoch: 3/5...  Training Step: 4732...  Training loss: 1.4724...  1.7931 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 4821...  Training loss: 1.3858...  1.8494 sec/batch
Epoch: 3/5...  Training Step: 4822...  Training loss: 1.4171...  1.8066 sec/batch
Epoch: 3/5...  Training Step: 4823...  Training loss: 1.4213...  1.8359 sec/batch
Epoch: 3/5...  Training Step: 4824...  Training loss: 1.3973...  1.8510 sec/batch
Epoch: 3/5...  Training Step: 4825...  Training loss: 1.4604...  1.8440 sec/batch
Epoch: 3/5...  Training Step: 4826...  Training loss: 1.3890...  1.9014 sec/batch
Epoch: 3/5...  Training Step: 4827...  Training loss: 1.4102...  1.8645 sec/batch
Epoch: 3/5...  Training Step: 4828...  Training loss: 1.4785...  1.8636 sec/batch
Epoch: 3/5...  Training Step: 4829...  Training loss: 1.3128...  1.8896 sec/batch
Epoch: 3/5...  Training Step: 4830...  Training loss: 1.3996...  1.8478 sec/batch
Epoch: 3/5...  Training Step: 4831...  Training loss: 1.4228...  1.8579 sec/batch
Epoch: 3/5...  Training Step: 4832...  Training loss: 1.3422...  1.8699 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 4921...  Training loss: 1.4349...  1.8263 sec/batch
Epoch: 3/5...  Training Step: 4922...  Training loss: 1.2540...  1.8335 sec/batch
Epoch: 3/5...  Training Step: 4923...  Training loss: 1.3736...  1.8429 sec/batch
Epoch: 3/5...  Training Step: 4924...  Training loss: 1.4157...  1.8439 sec/batch
Epoch: 3/5...  Training Step: 4925...  Training loss: 1.3340...  1.7678 sec/batch
Epoch: 3/5...  Training Step: 4926...  Training loss: 1.3678...  1.7888 sec/batch
Epoch: 3/5...  Training Step: 4927...  Training loss: 1.4297...  1.9014 sec/batch
Epoch: 3/5...  Training Step: 4928...  Training loss: 1.3733...  1.8429 sec/batch
Epoch: 3/5...  Training Step: 4929...  Training loss: 1.4159...  1.8271 sec/batch
Epoch: 3/5...  Training Step: 4930...  Training loss: 1.3659...  1.9078 sec/batch
Epoch: 3/5...  Training Step: 4931...  Training loss: 1.3441...  1.8567 sec/batch
Epoch: 3/5...  Training Step: 4932...  Training loss: 1.3276...  1.8220 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 5021...  Training loss: 1.3127...  1.8107 sec/batch
Epoch: 3/5...  Training Step: 5022...  Training loss: 1.4639...  1.7431 sec/batch
Epoch: 3/5...  Training Step: 5023...  Training loss: 1.4553...  1.8575 sec/batch
Epoch: 3/5...  Training Step: 5024...  Training loss: 1.4117...  1.9372 sec/batch
Epoch: 3/5...  Training Step: 5025...  Training loss: 1.3875...  1.8939 sec/batch
Epoch: 3/5...  Training Step: 5026...  Training loss: 1.3522...  1.8760 sec/batch
Epoch: 3/5...  Training Step: 5027...  Training loss: 1.3634...  1.8759 sec/batch
Epoch: 3/5...  Training Step: 5028...  Training loss: 1.3393...  1.8873 sec/batch
Epoch: 3/5...  Training Step: 5029...  Training loss: 1.3911...  1.9030 sec/batch
Epoch: 3/5...  Training Step: 5030...  Training loss: 1.2333...  1.7201 sec/batch
Epoch: 3/5...  Training Step: 5031...  Training loss: 1.5015...  1.7960 sec/batch
Epoch: 3/5...  Training Step: 5032...  Training loss: 1.3697...  1.8135 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 5121...  Training loss: 1.4798...  1.8524 sec/batch
Epoch: 3/5...  Training Step: 5122...  Training loss: 1.3545...  1.7896 sec/batch
Epoch: 3/5...  Training Step: 5123...  Training loss: 1.4169...  1.8323 sec/batch
Epoch: 3/5...  Training Step: 5124...  Training loss: 1.4409...  1.7763 sec/batch
Epoch: 3/5...  Training Step: 5125...  Training loss: 1.3689...  1.7852 sec/batch
Epoch: 3/5...  Training Step: 5126...  Training loss: 1.3260...  1.8124 sec/batch
Epoch: 3/5...  Training Step: 5127...  Training loss: 1.4201...  1.7926 sec/batch
Epoch: 3/5...  Training Step: 5128...  Training loss: 1.3361...  1.8302 sec/batch
Epoch: 3/5...  Training Step: 5129...  Training loss: 1.3475...  1.7836 sec/batch
Epoch: 3/5...  Training Step: 5130...  Training loss: 1.3712...  1.7514 sec/batch
Epoch: 3/5...  Training Step: 5131...  Training loss: 1.2814...  1.8060 sec/batch
Epoch: 3/5...  Training Step: 5132...  Training loss: 1.3398...  1.8035 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 5221...  Training loss: 1.3015...  1.8566 sec/batch
Epoch: 3/5...  Training Step: 5222...  Training loss: 1.3012...  1.7669 sec/batch
Epoch: 3/5...  Training Step: 5223...  Training loss: 1.3893...  1.8263 sec/batch
Epoch: 3/5...  Training Step: 5224...  Training loss: 1.3907...  1.8313 sec/batch
Epoch: 3/5...  Training Step: 5225...  Training loss: 1.4064...  1.7860 sec/batch
Epoch: 3/5...  Training Step: 5226...  Training loss: 1.2844...  1.8107 sec/batch
Epoch: 3/5...  Training Step: 5227...  Training loss: 1.3709...  1.7926 sec/batch
Epoch: 3/5...  Training Step: 5228...  Training loss: 1.3262...  1.8654 sec/batch
Epoch: 3/5...  Training Step: 5229...  Training loss: 1.4043...  1.9975 sec/batch
Epoch: 3/5...  Training Step: 5230...  Training loss: 1.4460...  1.8361 sec/batch
Epoch: 3/5...  Training Step: 5231...  Training loss: 1.2930...  1.9099 sec/batch
Epoch: 3/5...  Training Step: 5232...  Training loss: 1.3641...  2.0028 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 5321...  Training loss: 1.4247...  2.0420 sec/batch
Epoch: 3/5...  Training Step: 5322...  Training loss: 1.5122...  2.0535 sec/batch
Epoch: 3/5...  Training Step: 5323...  Training loss: 1.4555...  2.0595 sec/batch
Epoch: 3/5...  Training Step: 5324...  Training loss: 1.3379...  2.0019 sec/batch
Epoch: 3/5...  Training Step: 5325...  Training loss: 1.2936...  1.9699 sec/batch
Epoch: 3/5...  Training Step: 5326...  Training loss: 1.3245...  1.9524 sec/batch
Epoch: 3/5...  Training Step: 5327...  Training loss: 1.3365...  1.9289 sec/batch
Epoch: 3/5...  Training Step: 5328...  Training loss: 1.3475...  1.9434 sec/batch
Epoch: 3/5...  Training Step: 5329...  Training loss: 1.2883...  1.9884 sec/batch
Epoch: 3/5...  Training Step: 5330...  Training loss: 1.3461...  2.0020 sec/batch
Epoch: 3/5...  Training Step: 5331...  Training loss: 1.3416...  1.9449 sec/batch
Epoch: 3/5...  Training Step: 5332...  Training loss: 1.3370...  1.9489 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 5421...  Training loss: 1.2959...  2.0124 sec/batch
Epoch: 3/5...  Training Step: 5422...  Training loss: 1.3642...  2.0129 sec/batch
Epoch: 3/5...  Training Step: 5423...  Training loss: 1.3403...  1.9859 sec/batch
Epoch: 3/5...  Training Step: 5424...  Training loss: 1.4305...  1.9984 sec/batch
Epoch: 3/5...  Training Step: 5425...  Training loss: 1.2984...  1.9284 sec/batch
Epoch: 3/5...  Training Step: 5426...  Training loss: 1.3478...  1.9454 sec/batch
Epoch: 3/5...  Training Step: 5427...  Training loss: 1.3854...  2.0265 sec/batch
Epoch: 3/5...  Training Step: 5428...  Training loss: 1.4666...  1.9849 sec/batch
Epoch: 3/5...  Training Step: 5429...  Training loss: 1.4697...  1.9584 sec/batch
Epoch: 3/5...  Training Step: 5430...  Training loss: 1.4879...  1.9594 sec/batch
Epoch: 3/5...  Training Step: 5431...  Training loss: 1.3690...  1.9444 sec/batch
Epoch: 3/5...  Training Step: 5432...  Training loss: 1.4628...  1.9494 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 5521...  Training loss: 1.3344...  1.9048 sec/batch
Epoch: 3/5...  Training Step: 5522...  Training loss: 1.2147...  1.9148 sec/batch
Epoch: 3/5...  Training Step: 5523...  Training loss: 1.3374...  1.9644 sec/batch
Epoch: 3/5...  Training Step: 5524...  Training loss: 1.3260...  1.9959 sec/batch
Epoch: 3/5...  Training Step: 5525...  Training loss: 1.2920...  1.9444 sec/batch
Epoch: 3/5...  Training Step: 5526...  Training loss: 1.3661...  1.9494 sec/batch
Epoch: 3/5...  Training Step: 5527...  Training loss: 1.3125...  1.9268 sec/batch
Epoch: 3/5...  Training Step: 5528...  Training loss: 1.3041...  1.9233 sec/batch
Epoch: 3/5...  Training Step: 5529...  Training loss: 1.2882...  1.9414 sec/batch
Epoch: 3/5...  Training Step: 5530...  Training loss: 1.2499...  1.9193 sec/batch
Epoch: 3/5...  Training Step: 5531...  Training loss: 1.3233...  1.9198 sec/batch
Epoch: 3/5...  Training Step: 5532...  Training loss: 1.3949...  2.0304 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 5621...  Training loss: 1.4679...  2.0139 sec/batch
Epoch: 3/5...  Training Step: 5622...  Training loss: 1.4014...  2.0480 sec/batch
Epoch: 3/5...  Training Step: 5623...  Training loss: 1.4161...  2.0375 sec/batch
Epoch: 3/5...  Training Step: 5624...  Training loss: 1.3591...  2.0299 sec/batch
Epoch: 3/5...  Training Step: 5625...  Training loss: 1.4120...  2.0074 sec/batch
Epoch: 3/5...  Training Step: 5626...  Training loss: 1.3892...  2.0865 sec/batch
Epoch: 3/5...  Training Step: 5627...  Training loss: 1.4155...  2.0009 sec/batch
Epoch: 3/5...  Training Step: 5628...  Training loss: 1.3954...  2.0205 sec/batch
Epoch: 3/5...  Training Step: 5629...  Training loss: 1.3749...  2.0210 sec/batch
Epoch: 3/5...  Training Step: 5630...  Training loss: 1.3529...  2.0095 sec/batch
Epoch: 3/5...  Training Step: 5631...  Training loss: 1.3237...  1.9724 sec/batch
Epoch: 3/5...  Training Step: 5632...  Training loss: 1.3471...  1.9359 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 5721...  Training loss: 1.2991...  1.8739 sec/batch
Epoch: 3/5...  Training Step: 5722...  Training loss: 1.3182...  1.8376 sec/batch
Epoch: 3/5...  Training Step: 5723...  Training loss: 1.3648...  1.8600 sec/batch
Epoch: 3/5...  Training Step: 5724...  Training loss: 1.3497...  1.8101 sec/batch
Epoch: 3/5...  Training Step: 5725...  Training loss: 1.3628...  1.8368 sec/batch
Epoch: 3/5...  Training Step: 5726...  Training loss: 1.3484...  1.8110 sec/batch
Epoch: 3/5...  Training Step: 5727...  Training loss: 1.3172...  1.8531 sec/batch
Epoch: 3/5...  Training Step: 5728...  Training loss: 1.2874...  1.8490 sec/batch
Epoch: 3/5...  Training Step: 5729...  Training loss: 1.3373...  1.8209 sec/batch
Epoch: 3/5...  Training Step: 5730...  Training loss: 1.3624...  1.7836 sec/batch
Epoch: 3/5...  Training Step: 5731...  Training loss: 1.4033...  1.8431 sec/batch
Epoch: 3/5...  Training Step: 5732...  Training loss: 1.3461...  1.8288 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 5821...  Training loss: 1.3889...  1.8174 sec/batch
Epoch: 3/5...  Training Step: 5822...  Training loss: 1.4101...  1.7759 sec/batch
Epoch: 3/5...  Training Step: 5823...  Training loss: 1.3480...  1.8119 sec/batch
Epoch: 3/5...  Training Step: 5824...  Training loss: 1.3729...  1.8102 sec/batch
Epoch: 3/5...  Training Step: 5825...  Training loss: 1.3634...  1.8039 sec/batch
Epoch: 3/5...  Training Step: 5826...  Training loss: 1.4291...  1.7899 sec/batch
Epoch: 3/5...  Training Step: 5827...  Training loss: 1.4086...  1.8194 sec/batch
Epoch: 3/5...  Training Step: 5828...  Training loss: 1.5092...  1.8541 sec/batch
Epoch: 3/5...  Training Step: 5829...  Training loss: 1.3800...  1.7987 sec/batch
Epoch: 3/5...  Training Step: 5830...  Training loss: 1.3405...  1.8546 sec/batch
Epoch: 3/5...  Training Step: 5831...  Training loss: 1.3190...  1.8289 sec/batch
Epoch: 3/5...  Training Step: 5832...  Training loss: 1.5404...  1.8229 sec/batch
Epoch: 3/5...  T

Epoch: 3/5...  Training Step: 5921...  Training loss: 1.5736...  1.8309 sec/batch
Epoch: 3/5...  Training Step: 5922...  Training loss: 1.5695...  1.8090 sec/batch
Epoch: 3/5...  Training Step: 5923...  Training loss: 1.6113...  1.8251 sec/batch
Epoch: 3/5...  Training Step: 5924...  Training loss: 1.4355...  1.7835 sec/batch
Epoch: 3/5...  Training Step: 5925...  Training loss: 1.5760...  1.8316 sec/batch
Epoch: 3/5...  Training Step: 5926...  Training loss: 1.5280...  1.8881 sec/batch
Epoch: 3/5...  Training Step: 5927...  Training loss: 1.3529...  1.8059 sec/batch
Epoch: 3/5...  Training Step: 5928...  Training loss: 1.3148...  1.8401 sec/batch
Epoch: 3/5...  Training Step: 5929...  Training loss: 1.4088...  1.8189 sec/batch
Epoch: 3/5...  Training Step: 5930...  Training loss: 1.5196...  1.8138 sec/batch
Epoch: 3/5...  Training Step: 5931...  Training loss: 1.4818...  1.8107 sec/batch
Epoch: 3/5...  Training Step: 5932...  Training loss: 1.4079...  1.7856 sec/batch
Epoch: 3/5...  T

Epoch: 4/5...  Training Step: 6021...  Training loss: 1.3709...  1.8339 sec/batch
Epoch: 4/5...  Training Step: 6022...  Training loss: 1.3293...  1.8412 sec/batch
Epoch: 4/5...  Training Step: 6023...  Training loss: 1.2880...  1.9824 sec/batch
Epoch: 4/5...  Training Step: 6024...  Training loss: 1.3754...  1.9804 sec/batch
Epoch: 4/5...  Training Step: 6025...  Training loss: 1.3368...  1.9544 sec/batch
Epoch: 4/5...  Training Step: 6026...  Training loss: 1.4050...  1.9579 sec/batch
Epoch: 4/5...  Training Step: 6027...  Training loss: 1.3906...  2.0104 sec/batch
Epoch: 4/5...  Training Step: 6028...  Training loss: 1.3104...  1.9108 sec/batch
Epoch: 4/5...  Training Step: 6029...  Training loss: 1.3534...  1.9524 sec/batch
Epoch: 4/5...  Training Step: 6030...  Training loss: 1.2171...  1.9013 sec/batch
Epoch: 4/5...  Training Step: 6031...  Training loss: 1.3320...  1.9779 sec/batch
Epoch: 4/5...  Training Step: 6032...  Training loss: 1.2420...  2.0029 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 6121...  Training loss: 1.4502...  1.9434 sec/batch
Epoch: 4/5...  Training Step: 6122...  Training loss: 1.4098...  1.9534 sec/batch
Epoch: 4/5...  Training Step: 6123...  Training loss: 1.4602...  1.9419 sec/batch
Epoch: 4/5...  Training Step: 6124...  Training loss: 1.3551...  1.9449 sec/batch
Epoch: 4/5...  Training Step: 6125...  Training loss: 1.4235...  1.9254 sec/batch
Epoch: 4/5...  Training Step: 6126...  Training loss: 1.6435...  1.9429 sec/batch
Epoch: 4/5...  Training Step: 6127...  Training loss: 1.4434...  1.9299 sec/batch
Epoch: 4/5...  Training Step: 6128...  Training loss: 1.4017...  1.9324 sec/batch
Epoch: 4/5...  Training Step: 6129...  Training loss: 1.4781...  1.8858 sec/batch
Epoch: 4/5...  Training Step: 6130...  Training loss: 1.3720...  1.8728 sec/batch
Epoch: 4/5...  Training Step: 6131...  Training loss: 1.2753...  1.9168 sec/batch
Epoch: 4/5...  Training Step: 6132...  Training loss: 1.3443...  1.9218 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 6221...  Training loss: 1.4207...  2.0084 sec/batch
Epoch: 4/5...  Training Step: 6222...  Training loss: 1.2923...  1.9574 sec/batch
Epoch: 4/5...  Training Step: 6223...  Training loss: 1.3747...  1.9033 sec/batch
Epoch: 4/5...  Training Step: 6224...  Training loss: 1.2528...  1.9208 sec/batch
Epoch: 4/5...  Training Step: 6225...  Training loss: 1.1986...  1.9459 sec/batch
Epoch: 4/5...  Training Step: 6226...  Training loss: 1.2548...  1.9830 sec/batch
Epoch: 4/5...  Training Step: 6227...  Training loss: 1.3486...  1.9889 sec/batch
Epoch: 4/5...  Training Step: 6228...  Training loss: 1.3296...  1.9699 sec/batch
Epoch: 4/5...  Training Step: 6229...  Training loss: 1.3127...  1.9053 sec/batch
Epoch: 4/5...  Training Step: 6230...  Training loss: 1.2787...  1.9659 sec/batch
Epoch: 4/5...  Training Step: 6231...  Training loss: 1.2907...  1.9195 sec/batch
Epoch: 4/5...  Training Step: 6232...  Training loss: 1.2680...  1.9183 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 6321...  Training loss: 1.3160...  1.9810 sec/batch
Epoch: 4/5...  Training Step: 6322...  Training loss: 1.3275...  1.9699 sec/batch
Epoch: 4/5...  Training Step: 6323...  Training loss: 1.2772...  1.9149 sec/batch
Epoch: 4/5...  Training Step: 6324...  Training loss: 1.2805...  1.9103 sec/batch
Epoch: 4/5...  Training Step: 6325...  Training loss: 1.3416...  1.9188 sec/batch
Epoch: 4/5...  Training Step: 6326...  Training loss: 1.2710...  1.9409 sec/batch
Epoch: 4/5...  Training Step: 6327...  Training loss: 1.2997...  1.9003 sec/batch
Epoch: 4/5...  Training Step: 6328...  Training loss: 1.3591...  1.8833 sec/batch
Epoch: 4/5...  Training Step: 6329...  Training loss: 1.3613...  1.8933 sec/batch
Epoch: 4/5...  Training Step: 6330...  Training loss: 1.2945...  1.9439 sec/batch
Epoch: 4/5...  Training Step: 6331...  Training loss: 1.3910...  1.9594 sec/batch
Epoch: 4/5...  Training Step: 6332...  Training loss: 1.2936...  1.9139 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 6421...  Training loss: 1.2771...  1.9789 sec/batch
Epoch: 4/5...  Training Step: 6422...  Training loss: 1.3141...  1.9744 sec/batch
Epoch: 4/5...  Training Step: 6423...  Training loss: 1.2892...  2.0285 sec/batch
Epoch: 4/5...  Training Step: 6424...  Training loss: 1.2575...  1.9729 sec/batch
Epoch: 4/5...  Training Step: 6425...  Training loss: 1.2896...  1.9634 sec/batch
Epoch: 4/5...  Training Step: 6426...  Training loss: 1.2359...  1.9869 sec/batch
Epoch: 4/5...  Training Step: 6427...  Training loss: 1.3322...  2.0575 sec/batch
Epoch: 4/5...  Training Step: 6428...  Training loss: 1.3710...  2.0640 sec/batch
Epoch: 4/5...  Training Step: 6429...  Training loss: 1.2572...  2.0530 sec/batch
Epoch: 4/5...  Training Step: 6430...  Training loss: 1.3212...  2.0325 sec/batch
Epoch: 4/5...  Training Step: 6431...  Training loss: 1.3289...  1.9989 sec/batch
Epoch: 4/5...  Training Step: 6432...  Training loss: 1.3959...  1.9849 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 6521...  Training loss: 1.2902...  2.0129 sec/batch
Epoch: 4/5...  Training Step: 6522...  Training loss: 1.3537...  1.9584 sec/batch
Epoch: 4/5...  Training Step: 6523...  Training loss: 1.2826...  1.9639 sec/batch
Epoch: 4/5...  Training Step: 6524...  Training loss: 1.3575...  1.9614 sec/batch
Epoch: 4/5...  Training Step: 6525...  Training loss: 1.4439...  2.0009 sec/batch
Epoch: 4/5...  Training Step: 6526...  Training loss: 1.3163...  1.9954 sec/batch
Epoch: 4/5...  Training Step: 6527...  Training loss: 1.4435...  1.9108 sec/batch
Epoch: 4/5...  Training Step: 6528...  Training loss: 1.3671...  1.9324 sec/batch
Epoch: 4/5...  Training Step: 6529...  Training loss: 1.3906...  1.9384 sec/batch
Epoch: 4/5...  Training Step: 6530...  Training loss: 1.3763...  1.8843 sec/batch
Epoch: 4/5...  Training Step: 6531...  Training loss: 1.3895...  1.9248 sec/batch
Epoch: 4/5...  Training Step: 6532...  Training loss: 1.3738...  1.9379 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 6621...  Training loss: 1.4228...  1.9263 sec/batch
Epoch: 4/5...  Training Step: 6622...  Training loss: 1.4479...  2.0410 sec/batch
Epoch: 4/5...  Training Step: 6623...  Training loss: 1.2973...  1.9549 sec/batch
Epoch: 4/5...  Training Step: 6624...  Training loss: 1.3122...  2.0149 sec/batch
Epoch: 4/5...  Training Step: 6625...  Training loss: 1.3503...  1.9629 sec/batch
Epoch: 4/5...  Training Step: 6626...  Training loss: 1.1827...  1.9704 sec/batch
Epoch: 4/5...  Training Step: 6627...  Training loss: 1.3939...  1.9614 sec/batch
Epoch: 4/5...  Training Step: 6628...  Training loss: 1.3567...  2.0024 sec/batch
Epoch: 4/5...  Training Step: 6629...  Training loss: 1.2580...  1.9619 sec/batch
Epoch: 4/5...  Training Step: 6630...  Training loss: 1.3033...  1.9409 sec/batch
Epoch: 4/5...  Training Step: 6631...  Training loss: 1.2753...  1.9939 sec/batch
Epoch: 4/5...  Training Step: 6632...  Training loss: 1.2920...  1.9724 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 6721...  Training loss: 1.4221...  1.9419 sec/batch
Epoch: 4/5...  Training Step: 6722...  Training loss: 1.3240...  1.9504 sec/batch
Epoch: 4/5...  Training Step: 6723...  Training loss: 1.3916...  1.9454 sec/batch
Epoch: 4/5...  Training Step: 6724...  Training loss: 1.3109...  1.9354 sec/batch
Epoch: 4/5...  Training Step: 6725...  Training loss: 1.4854...  1.9379 sec/batch
Epoch: 4/5...  Training Step: 6726...  Training loss: 1.3794...  1.9664 sec/batch
Epoch: 4/5...  Training Step: 6727...  Training loss: 1.4591...  2.0154 sec/batch
Epoch: 4/5...  Training Step: 6728...  Training loss: 1.2929...  2.0039 sec/batch
Epoch: 4/5...  Training Step: 6729...  Training loss: 1.2433...  1.9739 sec/batch
Epoch: 4/5...  Training Step: 6730...  Training loss: 1.1821...  1.9649 sec/batch
Epoch: 4/5...  Training Step: 6731...  Training loss: 1.2843...  1.9314 sec/batch
Epoch: 4/5...  Training Step: 6732...  Training loss: 1.3876...  1.9297 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 6821...  Training loss: 1.4195...  1.9559 sec/batch
Epoch: 4/5...  Training Step: 6822...  Training loss: 1.4781...  1.9514 sec/batch
Epoch: 4/5...  Training Step: 6823...  Training loss: 1.3697...  1.9064 sec/batch
Epoch: 4/5...  Training Step: 6824...  Training loss: 1.3045...  1.9173 sec/batch
Epoch: 4/5...  Training Step: 6825...  Training loss: 1.3547...  1.9704 sec/batch
Epoch: 4/5...  Training Step: 6826...  Training loss: 1.2915...  1.9314 sec/batch
Epoch: 4/5...  Training Step: 6827...  Training loss: 1.3279...  1.9359 sec/batch
Epoch: 4/5...  Training Step: 6828...  Training loss: 1.2169...  1.9574 sec/batch
Epoch: 4/5...  Training Step: 6829...  Training loss: 1.3740...  1.9934 sec/batch
Epoch: 4/5...  Training Step: 6830...  Training loss: 1.2860...  1.9979 sec/batch
Epoch: 4/5...  Training Step: 6831...  Training loss: 1.3156...  1.9999 sec/batch
Epoch: 4/5...  Training Step: 6832...  Training loss: 1.3097...  1.9644 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 6921...  Training loss: 1.2535...  1.9349 sec/batch
Epoch: 4/5...  Training Step: 6922...  Training loss: 1.4359...  1.9354 sec/batch
Epoch: 4/5...  Training Step: 6923...  Training loss: 1.2787...  1.9209 sec/batch
Epoch: 4/5...  Training Step: 6924...  Training loss: 1.5962...  1.8808 sec/batch
Epoch: 4/5...  Training Step: 6925...  Training loss: 1.5801...  1.9253 sec/batch
Epoch: 4/5...  Training Step: 6926...  Training loss: 1.2410...  1.9504 sec/batch
Epoch: 4/5...  Training Step: 6927...  Training loss: 1.2871...  2.0094 sec/batch
Epoch: 4/5...  Training Step: 6928...  Training loss: 1.3198...  2.1260 sec/batch
Epoch: 4/5...  Training Step: 6929...  Training loss: 1.2766...  1.9649 sec/batch
Epoch: 4/5...  Training Step: 6930...  Training loss: 1.3003...  1.9884 sec/batch
Epoch: 4/5...  Training Step: 6931...  Training loss: 1.3913...  1.9509 sec/batch
Epoch: 4/5...  Training Step: 6932...  Training loss: 1.2952...  1.9474 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 7021...  Training loss: 1.2705...  1.9654 sec/batch
Epoch: 4/5...  Training Step: 7022...  Training loss: 1.3159...  1.9594 sec/batch
Epoch: 4/5...  Training Step: 7023...  Training loss: 1.3042...  1.9124 sec/batch
Epoch: 4/5...  Training Step: 7024...  Training loss: 1.3543...  1.9073 sec/batch
Epoch: 4/5...  Training Step: 7025...  Training loss: 1.3031...  1.8803 sec/batch
Epoch: 4/5...  Training Step: 7026...  Training loss: 1.3265...  1.8558 sec/batch
Epoch: 4/5...  Training Step: 7027...  Training loss: 1.3127...  1.8728 sec/batch
Epoch: 4/5...  Training Step: 7028...  Training loss: 1.3667...  1.8748 sec/batch
Epoch: 4/5...  Training Step: 7029...  Training loss: 1.3951...  1.8748 sec/batch
Epoch: 4/5...  Training Step: 7030...  Training loss: 1.3052...  1.8928 sec/batch
Epoch: 4/5...  Training Step: 7031...  Training loss: 1.4279...  1.9238 sec/batch
Epoch: 4/5...  Training Step: 7032...  Training loss: 1.3277...  2.0325 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 7121...  Training loss: 1.3086...  1.9494 sec/batch
Epoch: 4/5...  Training Step: 7122...  Training loss: 1.2083...  1.9489 sec/batch
Epoch: 4/5...  Training Step: 7123...  Training loss: 1.3359...  1.9404 sec/batch
Epoch: 4/5...  Training Step: 7124...  Training loss: 1.4635...  2.0044 sec/batch
Epoch: 4/5...  Training Step: 7125...  Training loss: 1.3873...  1.9364 sec/batch
Epoch: 4/5...  Training Step: 7126...  Training loss: 1.3042...  2.0360 sec/batch
Epoch: 4/5...  Training Step: 7127...  Training loss: 1.3184...  1.9268 sec/batch
Epoch: 4/5...  Training Step: 7128...  Training loss: 1.2488...  1.9354 sec/batch
Epoch: 4/5...  Training Step: 7129...  Training loss: 1.2993...  1.9349 sec/batch
Epoch: 4/5...  Training Step: 7130...  Training loss: 1.3922...  1.9098 sec/batch
Epoch: 4/5...  Training Step: 7131...  Training loss: 1.4730...  1.9233 sec/batch
Epoch: 4/5...  Training Step: 7132...  Training loss: 1.3787...  1.9409 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 7221...  Training loss: 1.2972...  1.9469 sec/batch
Epoch: 4/5...  Training Step: 7222...  Training loss: 1.2902...  1.9359 sec/batch
Epoch: 4/5...  Training Step: 7223...  Training loss: 1.3257...  1.9494 sec/batch
Epoch: 4/5...  Training Step: 7224...  Training loss: 1.3245...  1.9514 sec/batch
Epoch: 4/5...  Training Step: 7225...  Training loss: 1.4038...  2.0255 sec/batch
Epoch: 4/5...  Training Step: 7226...  Training loss: 1.2929...  2.0284 sec/batch
Epoch: 4/5...  Training Step: 7227...  Training loss: 1.2545...  1.9369 sec/batch
Epoch: 4/5...  Training Step: 7228...  Training loss: 1.3160...  1.9504 sec/batch
Epoch: 4/5...  Training Step: 7229...  Training loss: 1.3193...  1.9849 sec/batch
Epoch: 4/5...  Training Step: 7230...  Training loss: 1.3255...  1.9704 sec/batch
Epoch: 4/5...  Training Step: 7231...  Training loss: 1.3232...  1.9769 sec/batch
Epoch: 4/5...  Training Step: 7232...  Training loss: 1.1401...  1.9664 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 7321...  Training loss: 1.3141...  1.9614 sec/batch
Epoch: 4/5...  Training Step: 7322...  Training loss: 1.3594...  1.9444 sec/batch
Epoch: 4/5...  Training Step: 7323...  Training loss: 1.3811...  1.9309 sec/batch
Epoch: 4/5...  Training Step: 7324...  Training loss: 1.3099...  1.9494 sec/batch
Epoch: 4/5...  Training Step: 7325...  Training loss: 1.3341...  1.9389 sec/batch
Epoch: 4/5...  Training Step: 7326...  Training loss: 1.3281...  1.9268 sec/batch
Epoch: 4/5...  Training Step: 7327...  Training loss: 1.4006...  1.9243 sec/batch
Epoch: 4/5...  Training Step: 7328...  Training loss: 1.3753...  2.0114 sec/batch
Epoch: 4/5...  Training Step: 7329...  Training loss: 1.3100...  1.9554 sec/batch
Epoch: 4/5...  Training Step: 7330...  Training loss: 1.3355...  1.9714 sec/batch
Epoch: 4/5...  Training Step: 7331...  Training loss: 1.2979...  1.9954 sec/batch
Epoch: 4/5...  Training Step: 7332...  Training loss: 1.2289...  1.9899 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 7421...  Training loss: 1.2643...  2.0144 sec/batch
Epoch: 4/5...  Training Step: 7422...  Training loss: 1.3066...  2.0235 sec/batch
Epoch: 4/5...  Training Step: 7423...  Training loss: 1.3137...  2.0375 sec/batch
Epoch: 4/5...  Training Step: 7424...  Training loss: 1.2986...  2.0210 sec/batch
Epoch: 4/5...  Training Step: 7425...  Training loss: 1.3050...  2.0234 sec/batch
Epoch: 4/5...  Training Step: 7426...  Training loss: 1.4449...  1.9554 sec/batch
Epoch: 4/5...  Training Step: 7427...  Training loss: 1.2597...  1.9649 sec/batch
Epoch: 4/5...  Training Step: 7428...  Training loss: 1.2945...  1.9404 sec/batch
Epoch: 4/5...  Training Step: 7429...  Training loss: 1.1868...  1.9504 sec/batch
Epoch: 4/5...  Training Step: 7430...  Training loss: 1.1308...  1.9654 sec/batch
Epoch: 4/5...  Training Step: 7431...  Training loss: 1.1999...  1.9619 sec/batch
Epoch: 4/5...  Training Step: 7432...  Training loss: 1.1902...  1.9914 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 7521...  Training loss: 1.3084...  2.0224 sec/batch
Epoch: 4/5...  Training Step: 7522...  Training loss: 1.3397...  1.9844 sec/batch
Epoch: 4/5...  Training Step: 7523...  Training loss: 1.3792...  2.0400 sec/batch
Epoch: 4/5...  Training Step: 7524...  Training loss: 1.2936...  2.0430 sec/batch
Epoch: 4/5...  Training Step: 7525...  Training loss: 1.3599...  1.9879 sec/batch
Epoch: 4/5...  Training Step: 7526...  Training loss: 1.2385...  1.9634 sec/batch
Epoch: 4/5...  Training Step: 7527...  Training loss: 1.2696...  2.0665 sec/batch
Epoch: 4/5...  Training Step: 7528...  Training loss: 1.2753...  1.9779 sec/batch
Epoch: 4/5...  Training Step: 7529...  Training loss: 1.3105...  1.9839 sec/batch
Epoch: 4/5...  Training Step: 7530...  Training loss: 1.2420...  2.0249 sec/batch
Epoch: 4/5...  Training Step: 7531...  Training loss: 1.2784...  1.9744 sec/batch
Epoch: 4/5...  Training Step: 7532...  Training loss: 1.2211...  1.9218 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 7621...  Training loss: 1.3091...  1.9754 sec/batch
Epoch: 4/5...  Training Step: 7622...  Training loss: 1.3373...  1.9464 sec/batch
Epoch: 4/5...  Training Step: 7623...  Training loss: 1.4582...  1.9564 sec/batch
Epoch: 4/5...  Training Step: 7624...  Training loss: 1.4221...  2.0350 sec/batch
Epoch: 4/5...  Training Step: 7625...  Training loss: 1.3227...  2.0372 sec/batch
Epoch: 4/5...  Training Step: 7626...  Training loss: 1.3478...  1.9849 sec/batch
Epoch: 4/5...  Training Step: 7627...  Training loss: 1.3439...  2.0324 sec/batch
Epoch: 4/5...  Training Step: 7628...  Training loss: 1.3891...  2.0385 sec/batch
Epoch: 4/5...  Training Step: 7629...  Training loss: 1.4253...  1.9654 sec/batch
Epoch: 4/5...  Training Step: 7630...  Training loss: 1.3319...  1.9724 sec/batch
Epoch: 4/5...  Training Step: 7631...  Training loss: 1.3081...  1.9484 sec/batch
Epoch: 4/5...  Training Step: 7632...  Training loss: 1.2510...  1.9909 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 7721...  Training loss: 1.3028...  1.9539 sec/batch
Epoch: 4/5...  Training Step: 7722...  Training loss: 1.3517...  1.9304 sec/batch
Epoch: 4/5...  Training Step: 7723...  Training loss: 1.2875...  1.8988 sec/batch
Epoch: 4/5...  Training Step: 7724...  Training loss: 1.3546...  1.9204 sec/batch
Epoch: 4/5...  Training Step: 7725...  Training loss: 1.3598...  1.9572 sec/batch
Epoch: 4/5...  Training Step: 7726...  Training loss: 1.3335...  1.9113 sec/batch
Epoch: 4/5...  Training Step: 7727...  Training loss: 1.2561...  1.9178 sec/batch
Epoch: 4/5...  Training Step: 7728...  Training loss: 1.2239...  1.8943 sec/batch
Epoch: 4/5...  Training Step: 7729...  Training loss: 1.2452...  1.9063 sec/batch
Epoch: 4/5...  Training Step: 7730...  Training loss: 1.3141...  1.8873 sec/batch
Epoch: 4/5...  Training Step: 7731...  Training loss: 1.2874...  1.8728 sec/batch
Epoch: 4/5...  Training Step: 7732...  Training loss: 1.3780...  1.9289 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 7821...  Training loss: 1.3900...  2.0454 sec/batch
Epoch: 4/5...  Training Step: 7822...  Training loss: 1.2764...  2.0049 sec/batch
Epoch: 4/5...  Training Step: 7823...  Training loss: 1.4288...  1.9805 sec/batch
Epoch: 4/5...  Training Step: 7824...  Training loss: 1.4320...  1.9084 sec/batch
Epoch: 4/5...  Training Step: 7825...  Training loss: 1.3625...  1.8899 sec/batch
Epoch: 4/5...  Training Step: 7826...  Training loss: 1.4003...  1.9072 sec/batch
Epoch: 4/5...  Training Step: 7827...  Training loss: 1.4581...  1.8963 sec/batch
Epoch: 4/5...  Training Step: 7828...  Training loss: 1.3733...  1.9664 sec/batch
Epoch: 4/5...  Training Step: 7829...  Training loss: 1.4594...  1.9914 sec/batch
Epoch: 4/5...  Training Step: 7830...  Training loss: 1.3420...  1.9609 sec/batch
Epoch: 4/5...  Training Step: 7831...  Training loss: 1.3529...  1.9223 sec/batch
Epoch: 4/5...  Training Step: 7832...  Training loss: 1.3125...  1.9290 sec/batch
Epoch: 4/5...  T

Epoch: 4/5...  Training Step: 7921...  Training loss: 1.4063...  1.9182 sec/batch
Epoch: 4/5...  Training Step: 7922...  Training loss: 1.3764...  1.9309 sec/batch
Epoch: 4/5...  Training Step: 7923...  Training loss: 1.2533...  2.0134 sec/batch
Epoch: 4/5...  Training Step: 7924...  Training loss: 1.3543...  1.9253 sec/batch
Epoch: 4/5...  Training Step: 7925...  Training loss: 1.4643...  1.9342 sec/batch
Epoch: 4/5...  Training Step: 7926...  Training loss: 1.4310...  1.9482 sec/batch
Epoch: 4/5...  Training Step: 7927...  Training loss: 1.3095...  1.9842 sec/batch
Epoch: 4/5...  Training Step: 7928...  Training loss: 1.4241...  1.9878 sec/batch
Epoch: 4/5...  Training Step: 7929...  Training loss: 1.3458...  1.9984 sec/batch
Epoch: 4/5...  Training Step: 7930...  Training loss: 1.3083...  1.9815 sec/batch
Epoch: 4/5...  Training Step: 7931...  Training loss: 1.2910...  1.9298 sec/batch
Epoch: 4/5...  Training Step: 7932...  Training loss: 1.2984...  1.9164 sec/batch
Epoch: 4/5...  T

Epoch: 5/5...  Training Step: 8021...  Training loss: 1.3155...  1.9389 sec/batch
Epoch: 5/5...  Training Step: 8022...  Training loss: 1.3028...  1.9133 sec/batch
Epoch: 5/5...  Training Step: 8023...  Training loss: 1.3228...  1.9439 sec/batch
Epoch: 5/5...  Training Step: 8024...  Training loss: 1.3844...  1.9559 sec/batch
Epoch: 5/5...  Training Step: 8025...  Training loss: 1.2871...  1.9909 sec/batch
Epoch: 5/5...  Training Step: 8026...  Training loss: 1.3049...  1.9288 sec/batch
Epoch: 5/5...  Training Step: 8027...  Training loss: 1.3177...  1.9954 sec/batch
Epoch: 5/5...  Training Step: 8028...  Training loss: 1.2868...  1.9689 sec/batch
Epoch: 5/5...  Training Step: 8029...  Training loss: 1.2414...  1.9434 sec/batch
Epoch: 5/5...  Training Step: 8030...  Training loss: 1.2253...  1.9409 sec/batch
Epoch: 5/5...  Training Step: 8031...  Training loss: 1.3143...  1.9319 sec/batch
Epoch: 5/5...  Training Step: 8032...  Training loss: 1.2354...  1.9329 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 8121...  Training loss: 1.3979...  2.0925 sec/batch
Epoch: 5/5...  Training Step: 8122...  Training loss: 1.3165...  2.1421 sec/batch
Epoch: 5/5...  Training Step: 8123...  Training loss: 1.3324...  2.0975 sec/batch
Epoch: 5/5...  Training Step: 8124...  Training loss: 1.4015...  2.0044 sec/batch
Epoch: 5/5...  Training Step: 8125...  Training loss: 1.3461...  2.0750 sec/batch
Epoch: 5/5...  Training Step: 8126...  Training loss: 1.3119...  2.0480 sec/batch
Epoch: 5/5...  Training Step: 8127...  Training loss: 1.2425...  1.9664 sec/batch
Epoch: 5/5...  Training Step: 8128...  Training loss: 1.3634...  1.8783 sec/batch
Epoch: 5/5...  Training Step: 8129...  Training loss: 1.4249...  2.0084 sec/batch
Epoch: 5/5...  Training Step: 8130...  Training loss: 1.3377...  2.0035 sec/batch
Epoch: 5/5...  Training Step: 8131...  Training loss: 1.3865...  1.9769 sec/batch
Epoch: 5/5...  Training Step: 8132...  Training loss: 1.4687...  1.9854 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 8221...  Training loss: 1.3344...  1.9178 sec/batch
Epoch: 5/5...  Training Step: 8222...  Training loss: 1.1976...  1.9211 sec/batch
Epoch: 5/5...  Training Step: 8223...  Training loss: 1.1747...  1.9113 sec/batch
Epoch: 5/5...  Training Step: 8224...  Training loss: 1.2476...  1.8999 sec/batch
Epoch: 5/5...  Training Step: 8225...  Training loss: 1.3710...  1.9729 sec/batch
Epoch: 5/5...  Training Step: 8226...  Training loss: 1.3015...  2.0202 sec/batch
Epoch: 5/5...  Training Step: 8227...  Training loss: 1.2184...  1.9243 sec/batch
Epoch: 5/5...  Training Step: 8228...  Training loss: 1.2622...  1.9395 sec/batch
Epoch: 5/5...  Training Step: 8229...  Training loss: 1.3319...  1.9406 sec/batch
Epoch: 5/5...  Training Step: 8230...  Training loss: 1.3704...  1.9585 sec/batch
Epoch: 5/5...  Training Step: 8231...  Training loss: 1.3179...  1.9352 sec/batch
Epoch: 5/5...  Training Step: 8232...  Training loss: 1.3937...  1.9549 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 8321...  Training loss: 1.2562...  1.8928 sec/batch
Epoch: 5/5...  Training Step: 8322...  Training loss: 1.2872...  1.9088 sec/batch
Epoch: 5/5...  Training Step: 8323...  Training loss: 1.2801...  1.8983 sec/batch
Epoch: 5/5...  Training Step: 8324...  Training loss: 1.3088...  1.9529 sec/batch
Epoch: 5/5...  Training Step: 8325...  Training loss: 1.3436...  1.8928 sec/batch
Epoch: 5/5...  Training Step: 8326...  Training loss: 1.2460...  1.9434 sec/batch
Epoch: 5/5...  Training Step: 8327...  Training loss: 1.3986...  1.9569 sec/batch
Epoch: 5/5...  Training Step: 8328...  Training loss: 1.3470...  1.9208 sec/batch
Epoch: 5/5...  Training Step: 8329...  Training loss: 1.3968...  1.9374 sec/batch
Epoch: 5/5...  Training Step: 8330...  Training loss: 1.3571...  1.9088 sec/batch
Epoch: 5/5...  Training Step: 8331...  Training loss: 1.2947...  1.9359 sec/batch
Epoch: 5/5...  Training Step: 8332...  Training loss: 1.2337...  1.9399 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 8421...  Training loss: 1.2251...  1.9684 sec/batch
Epoch: 5/5...  Training Step: 8422...  Training loss: 1.2018...  2.0177 sec/batch
Epoch: 5/5...  Training Step: 8423...  Training loss: 1.3390...  1.9584 sec/batch
Epoch: 5/5...  Training Step: 8424...  Training loss: 1.2931...  2.0804 sec/batch
Epoch: 5/5...  Training Step: 8425...  Training loss: 1.2425...  1.9529 sec/batch
Epoch: 5/5...  Training Step: 8426...  Training loss: 1.2713...  1.9684 sec/batch
Epoch: 5/5...  Training Step: 8427...  Training loss: 1.2488...  1.9974 sec/batch
Epoch: 5/5...  Training Step: 8428...  Training loss: 1.3253...  1.9914 sec/batch
Epoch: 5/5...  Training Step: 8429...  Training loss: 1.3254...  1.9294 sec/batch
Epoch: 5/5...  Training Step: 8430...  Training loss: 1.2215...  2.0425 sec/batch
Epoch: 5/5...  Training Step: 8431...  Training loss: 1.2428...  2.0615 sec/batch
Epoch: 5/5...  Training Step: 8432...  Training loss: 1.3694...  2.0064 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 8521...  Training loss: 1.3060...  2.0099 sec/batch
Epoch: 5/5...  Training Step: 8522...  Training loss: 1.3254...  2.0394 sec/batch
Epoch: 5/5...  Training Step: 8523...  Training loss: 1.3600...  2.0670 sec/batch
Epoch: 5/5...  Training Step: 8524...  Training loss: 1.2693...  2.0019 sec/batch
Epoch: 5/5...  Training Step: 8525...  Training loss: 1.2490...  2.0044 sec/batch
Epoch: 5/5...  Training Step: 8526...  Training loss: 1.1816...  2.1861 sec/batch
Epoch: 5/5...  Training Step: 8527...  Training loss: 1.2064...  2.0299 sec/batch
Epoch: 5/5...  Training Step: 8528...  Training loss: 1.2847...  1.9989 sec/batch
Epoch: 5/5...  Training Step: 8529...  Training loss: 1.2055...  2.0590 sec/batch
Epoch: 5/5...  Training Step: 8530...  Training loss: 1.2045...  1.9167 sec/batch
Epoch: 5/5...  Training Step: 8531...  Training loss: 1.3609...  1.8943 sec/batch
Epoch: 5/5...  Training Step: 8532...  Training loss: 1.3173...  1.9228 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 8621...  Training loss: 1.3240...  1.8853 sec/batch
Epoch: 5/5...  Training Step: 8622...  Training loss: 1.2114...  1.9038 sec/batch
Epoch: 5/5...  Training Step: 8623...  Training loss: 1.2446...  1.9003 sec/batch
Epoch: 5/5...  Training Step: 8624...  Training loss: 1.2130...  1.8773 sec/batch
Epoch: 5/5...  Training Step: 8625...  Training loss: 1.4193...  1.9227 sec/batch
Epoch: 5/5...  Training Step: 8626...  Training loss: 1.2549...  1.9464 sec/batch
Epoch: 5/5...  Training Step: 8627...  Training loss: 1.2869...  1.9494 sec/batch
Epoch: 5/5...  Training Step: 8628...  Training loss: 1.2817...  1.9919 sec/batch
Epoch: 5/5...  Training Step: 8629...  Training loss: 1.3054...  1.9989 sec/batch
Epoch: 5/5...  Training Step: 8630...  Training loss: 1.1985...  1.9572 sec/batch
Epoch: 5/5...  Training Step: 8631...  Training loss: 1.3923...  2.0145 sec/batch
Epoch: 5/5...  Training Step: 8632...  Training loss: 1.4164...  2.0515 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 8721...  Training loss: 1.2929...  1.9705 sec/batch
Epoch: 5/5...  Training Step: 8722...  Training loss: 1.3125...  2.0870 sec/batch
Epoch: 5/5...  Training Step: 8723...  Training loss: 1.3975...  2.0139 sec/batch
Epoch: 5/5...  Training Step: 8724...  Training loss: 1.2608...  1.9974 sec/batch
Epoch: 5/5...  Training Step: 8725...  Training loss: 1.2338...  1.9584 sec/batch
Epoch: 5/5...  Training Step: 8726...  Training loss: 1.2749...  1.9654 sec/batch
Epoch: 5/5...  Training Step: 8727...  Training loss: 1.3530...  1.9324 sec/batch
Epoch: 5/5...  Training Step: 8728...  Training loss: 1.3419...  2.0114 sec/batch
Epoch: 5/5...  Training Step: 8729...  Training loss: 1.3998...  2.0134 sec/batch
Epoch: 5/5...  Training Step: 8730...  Training loss: 1.3630...  1.9469 sec/batch
Epoch: 5/5...  Training Step: 8731...  Training loss: 1.4463...  1.9984 sec/batch
Epoch: 5/5...  Training Step: 8732...  Training loss: 1.2916...  1.9879 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 8821...  Training loss: 1.3400...  2.0485 sec/batch
Epoch: 5/5...  Training Step: 8822...  Training loss: 1.2951...  2.0685 sec/batch
Epoch: 5/5...  Training Step: 8823...  Training loss: 1.3729...  2.0635 sec/batch
Epoch: 5/5...  Training Step: 8824...  Training loss: 1.2380...  2.0530 sec/batch
Epoch: 5/5...  Training Step: 8825...  Training loss: 1.3613...  2.0860 sec/batch
Epoch: 5/5...  Training Step: 8826...  Training loss: 1.2218...  2.1026 sec/batch
Epoch: 5/5...  Training Step: 8827...  Training loss: 1.3234...  2.0605 sec/batch
Epoch: 5/5...  Training Step: 8828...  Training loss: 1.2461...  2.1015 sec/batch
Epoch: 5/5...  Training Step: 8829...  Training loss: 1.2596...  2.0730 sec/batch
Epoch: 5/5...  Training Step: 8830...  Training loss: 1.2124...  2.0645 sec/batch
Epoch: 5/5...  Training Step: 8831...  Training loss: 1.2510...  2.0360 sec/batch
Epoch: 5/5...  Training Step: 8832...  Training loss: 1.2603...  1.9914 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 8921...  Training loss: 1.3051...  1.9970 sec/batch
Epoch: 5/5...  Training Step: 8922...  Training loss: 1.1839...  1.9899 sec/batch
Epoch: 5/5...  Training Step: 8923...  Training loss: 1.2002...  2.0064 sec/batch
Epoch: 5/5...  Training Step: 8924...  Training loss: 1.3313...  2.0244 sec/batch
Epoch: 5/5...  Training Step: 8925...  Training loss: 1.3990...  1.9509 sec/batch
Epoch: 5/5...  Training Step: 8926...  Training loss: 1.2762...  1.9404 sec/batch
Epoch: 5/5...  Training Step: 8927...  Training loss: 1.3439...  1.9534 sec/batch
Epoch: 5/5...  Training Step: 8928...  Training loss: 1.2454...  1.9714 sec/batch
Epoch: 5/5...  Training Step: 8929...  Training loss: 1.2453...  2.0390 sec/batch
Epoch: 5/5...  Training Step: 8930...  Training loss: 1.3628...  1.9709 sec/batch
Epoch: 5/5...  Training Step: 8931...  Training loss: 1.2642...  2.0324 sec/batch
Epoch: 5/5...  Training Step: 8932...  Training loss: 1.3824...  1.9674 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 9021...  Training loss: 1.2299...  1.9889 sec/batch
Epoch: 5/5...  Training Step: 9022...  Training loss: 1.2393...  1.9989 sec/batch
Epoch: 5/5...  Training Step: 9023...  Training loss: 1.2994...  1.9509 sec/batch
Epoch: 5/5...  Training Step: 9024...  Training loss: 1.2521...  1.9304 sec/batch
Epoch: 5/5...  Training Step: 9025...  Training loss: 1.2744...  1.9319 sec/batch
Epoch: 5/5...  Training Step: 9026...  Training loss: 1.2862...  1.9844 sec/batch
Epoch: 5/5...  Training Step: 9027...  Training loss: 1.3140...  1.9794 sec/batch
Epoch: 5/5...  Training Step: 9028...  Training loss: 1.3897...  2.0174 sec/batch
Epoch: 5/5...  Training Step: 9029...  Training loss: 1.2883...  1.9774 sec/batch
Epoch: 5/5...  Training Step: 9030...  Training loss: 1.2706...  1.9644 sec/batch
Epoch: 5/5...  Training Step: 9031...  Training loss: 1.3385...  1.8673 sec/batch
Epoch: 5/5...  Training Step: 9032...  Training loss: 1.2237...  1.9033 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 9121...  Training loss: 1.3988...  1.9414 sec/batch
Epoch: 5/5...  Training Step: 9122...  Training loss: 1.3375...  1.9945 sec/batch
Epoch: 5/5...  Training Step: 9123...  Training loss: 1.2111...  1.9645 sec/batch
Epoch: 5/5...  Training Step: 9124...  Training loss: 1.1819...  1.9434 sec/batch
Epoch: 5/5...  Training Step: 9125...  Training loss: 1.2460...  1.9974 sec/batch
Epoch: 5/5...  Training Step: 9126...  Training loss: 1.3116...  1.9369 sec/batch
Epoch: 5/5...  Training Step: 9127...  Training loss: 1.3040...  1.9319 sec/batch
Epoch: 5/5...  Training Step: 9128...  Training loss: 1.2971...  2.0004 sec/batch
Epoch: 5/5...  Training Step: 9129...  Training loss: 1.2239...  1.9098 sec/batch
Epoch: 5/5...  Training Step: 9130...  Training loss: 1.2943...  1.9614 sec/batch
Epoch: 5/5...  Training Step: 9131...  Training loss: 1.3237...  2.0279 sec/batch
Epoch: 5/5...  Training Step: 9132...  Training loss: 1.3345...  1.9939 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 9221...  Training loss: 1.2475...  1.9263 sec/batch
Epoch: 5/5...  Training Step: 9222...  Training loss: 1.1908...  1.8773 sec/batch
Epoch: 5/5...  Training Step: 9223...  Training loss: 1.2144...  1.9184 sec/batch
Epoch: 5/5...  Training Step: 9224...  Training loss: 1.2062...  1.8858 sec/batch
Epoch: 5/5...  Training Step: 9225...  Training loss: 1.2027...  1.9504 sec/batch
Epoch: 5/5...  Training Step: 9226...  Training loss: 1.2219...  2.0039 sec/batch
Epoch: 5/5...  Training Step: 9227...  Training loss: 1.2953...  2.1352 sec/batch
Epoch: 5/5...  Training Step: 9228...  Training loss: 1.2579...  1.8724 sec/batch
Epoch: 5/5...  Training Step: 9229...  Training loss: 1.3673...  1.8390 sec/batch
Epoch: 5/5...  Training Step: 9230...  Training loss: 1.3236...  1.8271 sec/batch
Epoch: 5/5...  Training Step: 9231...  Training loss: 1.2915...  1.8759 sec/batch
Epoch: 5/5...  Training Step: 9232...  Training loss: 1.1990...  1.7673 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 9321...  Training loss: 1.2707...  1.9028 sec/batch
Epoch: 5/5...  Training Step: 9322...  Training loss: 1.3151...  2.1671 sec/batch
Epoch: 5/5...  Training Step: 9323...  Training loss: 1.2987...  1.9924 sec/batch
Epoch: 5/5...  Training Step: 9324...  Training loss: 1.2325...  1.9559 sec/batch
Epoch: 5/5...  Training Step: 9325...  Training loss: 1.3593...  2.0550 sec/batch
Epoch: 5/5...  Training Step: 9326...  Training loss: 1.3526...  2.0495 sec/batch
Epoch: 5/5...  Training Step: 9327...  Training loss: 1.2612...  1.9784 sec/batch
Epoch: 5/5...  Training Step: 9328...  Training loss: 1.2696...  2.0234 sec/batch
Epoch: 5/5...  Training Step: 9329...  Training loss: 1.2370...  1.9544 sec/batch
Epoch: 5/5...  Training Step: 9330...  Training loss: 1.2467...  1.9569 sec/batch
Epoch: 5/5...  Training Step: 9331...  Training loss: 1.2703...  1.9999 sec/batch
Epoch: 5/5...  Training Step: 9332...  Training loss: 1.1833...  2.0215 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 9421...  Training loss: 1.2969...  2.1041 sec/batch
Epoch: 5/5...  Training Step: 9422...  Training loss: 1.2126...  2.2783 sec/batch
Epoch: 5/5...  Training Step: 9423...  Training loss: 1.2276...  2.0360 sec/batch
Epoch: 5/5...  Training Step: 9424...  Training loss: 1.2301...  2.0114 sec/batch
Epoch: 5/5...  Training Step: 9425...  Training loss: 1.2360...  1.9484 sec/batch
Epoch: 5/5...  Training Step: 9426...  Training loss: 1.2787...  1.9984 sec/batch
Epoch: 5/5...  Training Step: 9427...  Training loss: 1.2771...  2.1471 sec/batch
Epoch: 5/5...  Training Step: 9428...  Training loss: 1.2338...  2.0130 sec/batch
Epoch: 5/5...  Training Step: 9429...  Training loss: 1.3055...  2.0345 sec/batch
Epoch: 5/5...  Training Step: 9430...  Training loss: 1.3203...  2.0195 sec/batch
Epoch: 5/5...  Training Step: 9431...  Training loss: 1.3838...  2.0229 sec/batch
Epoch: 5/5...  Training Step: 9432...  Training loss: 1.1630...  2.0565 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 9521...  Training loss: 1.3618...  1.9159 sec/batch
Epoch: 5/5...  Training Step: 9522...  Training loss: 1.2770...  1.9128 sec/batch
Epoch: 5/5...  Training Step: 9523...  Training loss: 1.1877...  1.9884 sec/batch
Epoch: 5/5...  Training Step: 9524...  Training loss: 1.1398...  2.0510 sec/batch
Epoch: 5/5...  Training Step: 9525...  Training loss: 1.2777...  2.0044 sec/batch
Epoch: 5/5...  Training Step: 9526...  Training loss: 1.2541...  1.9033 sec/batch
Epoch: 5/5...  Training Step: 9527...  Training loss: 1.2319...  1.9674 sec/batch
Epoch: 5/5...  Training Step: 9528...  Training loss: 1.2071...  1.8978 sec/batch
Epoch: 5/5...  Training Step: 9529...  Training loss: 1.2497...  1.9449 sec/batch
Epoch: 5/5...  Training Step: 9530...  Training loss: 1.2637...  1.9904 sec/batch
Epoch: 5/5...  Training Step: 9531...  Training loss: 1.2033...  1.9664 sec/batch
Epoch: 5/5...  Training Step: 9532...  Training loss: 1.2980...  1.9489 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 9621...  Training loss: 1.2180...  1.9789 sec/batch
Epoch: 5/5...  Training Step: 9622...  Training loss: 1.2232...  1.9579 sec/batch
Epoch: 5/5...  Training Step: 9623...  Training loss: 1.3052...  2.0259 sec/batch
Epoch: 5/5...  Training Step: 9624...  Training loss: 1.3499...  1.9944 sec/batch
Epoch: 5/5...  Training Step: 9625...  Training loss: 1.2818...  2.0269 sec/batch
Epoch: 5/5...  Training Step: 9626...  Training loss: 1.3750...  1.9539 sec/batch
Epoch: 5/5...  Training Step: 9627...  Training loss: 1.3305...  1.9038 sec/batch
Epoch: 5/5...  Training Step: 9628...  Training loss: 1.3007...  1.9934 sec/batch
Epoch: 5/5...  Training Step: 9629...  Training loss: 1.2920...  2.0119 sec/batch
Epoch: 5/5...  Training Step: 9630...  Training loss: 1.1900...  1.9459 sec/batch
Epoch: 5/5...  Training Step: 9631...  Training loss: 1.1874...  1.9949 sec/batch
Epoch: 5/5...  Training Step: 9632...  Training loss: 1.1973...  1.9999 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 9721...  Training loss: 1.2471...  1.9576 sec/batch
Epoch: 5/5...  Training Step: 9722...  Training loss: 1.2475...  1.9778 sec/batch
Epoch: 5/5...  Training Step: 9723...  Training loss: 1.2494...  2.1456 sec/batch
Epoch: 5/5...  Training Step: 9724...  Training loss: 1.1922...  2.2462 sec/batch
Epoch: 5/5...  Training Step: 9725...  Training loss: 1.2005...  2.1826 sec/batch
Epoch: 5/5...  Training Step: 9726...  Training loss: 1.3809...  2.1461 sec/batch
Epoch: 5/5...  Training Step: 9727...  Training loss: 1.2909...  2.0375 sec/batch
Epoch: 5/5...  Training Step: 9728...  Training loss: 1.1783...  2.0275 sec/batch
Epoch: 5/5...  Training Step: 9729...  Training loss: 1.2470...  2.0400 sec/batch
Epoch: 5/5...  Training Step: 9730...  Training loss: 1.1890...  2.1401 sec/batch
Epoch: 5/5...  Training Step: 9731...  Training loss: 1.2216...  2.1045 sec/batch
Epoch: 5/5...  Training Step: 9732...  Training loss: 1.3158...  2.1671 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 9821...  Training loss: 1.2640...  1.9584 sec/batch
Epoch: 5/5...  Training Step: 9822...  Training loss: 1.3237...  1.9614 sec/batch
Epoch: 5/5...  Training Step: 9823...  Training loss: 1.2680...  1.9654 sec/batch
Epoch: 5/5...  Training Step: 9824...  Training loss: 1.2793...  2.0645 sec/batch
Epoch: 5/5...  Training Step: 9825...  Training loss: 1.2876...  1.9309 sec/batch
Epoch: 5/5...  Training Step: 9826...  Training loss: 1.2644...  1.9769 sec/batch
Epoch: 5/5...  Training Step: 9827...  Training loss: 1.3857...  2.0405 sec/batch
Epoch: 5/5...  Training Step: 9828...  Training loss: 1.2404...  1.8883 sec/batch
Epoch: 5/5...  Training Step: 9829...  Training loss: 1.3384...  1.8928 sec/batch
Epoch: 5/5...  Training Step: 9830...  Training loss: 1.3901...  1.9549 sec/batch
Epoch: 5/5...  Training Step: 9831...  Training loss: 1.1717...  1.9169 sec/batch
Epoch: 5/5...  Training Step: 9832...  Training loss: 1.2779...  1.8918 sec/batch
Epoch: 5/5...  T

Epoch: 5/5...  Training Step: 9921...  Training loss: 1.3358...  1.8201 sec/batch
Epoch: 5/5...  Training Step: 9922...  Training loss: 1.3362...  1.9240 sec/batch
Epoch: 5/5...  Training Step: 9923...  Training loss: 1.2579...  1.8497 sec/batch
Epoch: 5/5...  Training Step: 9924...  Training loss: 1.3398...  1.8562 sec/batch
Epoch: 5/5...  Training Step: 9925...  Training loss: 1.3304...  1.8209 sec/batch


#### Saved checkpoints

Read up on saving and loading checkpoints here: https://www.tensorflow.org/programmers_guide/variables

In [19]:
tf.train.get_checkpoint_state('checkpoints')

model_checkpoint_path: "checkpoints\\i9925_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i200_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i400_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i600_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i800_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i1000_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i1200_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i1400_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i1600_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i1800_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i2000_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i2200_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i2400_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i2600_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i2800_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i3000_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i3200_l512.ckpt"
all_mo

## Sampling

Now that the network is trained, we'll can use it to generate new text. The idea is that we pass in a character, then the network will predict the next character. We can use the new one, to predict the next one. And we keep doing this to generate all new text. I also included some functionality to prime the network with some text by passing in a string and building up a state from that.

The network gives us predictions for each character. To reduce noise and make things a little less random, I'm going to only choose a new character from the top N most likely characters.



In [20]:
def pick_top_n(preds, vocab_size, top_n=5):
    p = np.squeeze(preds)
    p[np.argsort(p)[:-top_n]] = 0
    p = p / np.sum(p)
    c = np.random.choice(vocab_size, 1, p=p)[0]
    return c

In [21]:
def sample(checkpoint, n_samples, lstm_size, vocab_size, prime="The "):
    samples = [c for c in prime]
    model = CharRNN(len(vocab), lstm_size=lstm_size, sampling=True)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        saver.restore(sess, checkpoint)
        new_state = sess.run(model.initial_state)
        for c in prime:
            x = np.zeros((1, 1))
            x[0,0] = vocab_to_int[c]
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

        c = pick_top_n(preds, len(vocab))
        samples.append(int_to_vocab[c])

        for i in range(n_samples):
            x[0,0] = c
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

            c = pick_top_n(preds, len(vocab))
            samples.append(int_to_vocab[c])
        
    return ''.join(samples)

Here, pass in the path to a checkpoint and sample from the network.

In [22]:
tf.train.latest_checkpoint('checkpoints')

'checkpoints\\i9925_l512.ckpt'

In [23]:
checkpoint = tf.train.latest_checkpoint('checkpoints')
samp = sample(checkpoint, 2000, lstm_size, len(vocab), prime="Far")
print(samp)

INFO:tensorflow:Restoring parameters from checkpoints\i9925_l512.ckpt
Farance Levin should not say she and her husband, had such a movement, whom he had not said a striggle of
happy of the carriage the pretty well-buttings was saying as the persons was all the boy of sories of a minute one straight. He felt she would have been directly, when he could the way in walling on him. He was saying.

"I am nor answer. Will
you think to
the mushroom out of that more any one," he said to her and his silence.

"What do you see it to
ask your principle, a cross were already about me all."

"Not a cruak. If the same time with a man interesting
when the mire, and then this was a secrice, we have tried. He's a
little thoughts and so almost sorts
of the balls," answered Levin, like all something or and attractive, but a shirt were she would be to then. "There is stull in the boutting of
the bottle what they were
that I don't want for yourself at the place of the point. I don't suppose I don't say anyt

In [24]:
checkpoint = 'checkpoints/i200_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

INFO:tensorflow:Restoring parameters from checkpoints/i200_l512.ckpt
Farsde,.

"" hhee on he tat ot are he ood an aot or too hed te or at al aled win ant an an aldd
ate on hees tha tin ase inn ot ti her an he oo tee oo an ione

onn ot e anl har hee in aned tarens th oone otin the
 ter ta io in her toes
tee whe tin int in ha tisin oo ten oong hee ote an thir ion oe hind an teis he aon and the tes ao oid on ite tine ald aind at al and an inned hi on an aode win ao thh tee tin thee wenthin hh en he and ote thee oed
antin hane, ao he the is the sine and ant ho at ha itingg ale ter anedd tit the an aon he etes
inne whe tees tae tone
etinn in ote on hit en to at oe aot an altee his ated, had oid hin tars aonther, ate ante ointh eed theesse aod he intere ase tee int
han tio the ate tens
oint ot toe at on iod he on ans ood wo te ot oe ine at andd ann had hed
ee ito in hhed as heris annd oon at heed tir ine oit itit ithhe
sennt an tat toe ao ane hees oter he thar oon oon, he in tie his on to an

In [25]:
checkpoint = 'checkpoints/i600_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

INFO:tensorflow:Restoring parameters from checkpoints/i600_l512.ckpt
Farny and stace on a that was
ig to sit to shat than's to he tore him, the singere thas the camles to
ta deerting of the reoving of the sispiles the loot the thang what shithe sor he dove to the
erand, and to here and alding washer ta to the he stong the teen then she tan wo the her and, what shingted thime was a loult to had his as that it that he tanken as he didene and a lougting the cound to the her was the courd on he seided
andaying to tan it thing of in seat a that,
bound on there worly, and and asded of and te har and his stind of hid bont the
erenter that he chanere a the shid
that
tere than hors to the simleress of thon here whin the canle, and sald and whilk nowe whach, the so she sald the sale who dores and to chentelt this andong wome thay. 
his, and that the cardaded the pistor, and that
thay the candating of hears hore thought of th to shomed he and a lat oulde so canded that his at ale to ta dertell on

In [26]:
checkpoint = 'checkpoints/i1200_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

INFO:tensorflow:Restoring parameters from checkpoints/i1200_l512.ckpt
Farring at a mant, and to a loncter the were of all and he was a mist and theme was neriged, to him alrow and a with wifher there hould to the carrtally.

"She was to them the was a companters and the pestes. That I was nere to so mone of her the some on his was to a mant," sour in a the working with a lowers. "The coult, were were, what she sumple of this his whores, as she was at in thes. I he said, with, she sand all the crouncest was taken, when the where was not,, and had, and was seritult to her have him to be winl hinger.. 
"I a ming thought her his cale, and, and the marr one whore the way to had not the parss," said her the saming on the propritess of her her that was the some
and wifing his aterst with whichers, he
seeded in the parres that here all a the was one as to her seed him the choll of she with the poncess it
her, who have sto got her to there seed to to his the prostly, who
was storded the seare t