#  Udacity Machine Learning Nanodegree Capstone Project

##  Recurrent Neural Network based language model.

## Naren Doraiswamy

## August 2017

The traditional neural networks have given some great results after the evolution of the kind of computational efficiency that is available today and the huge amounts of data that can be processed with this computational power. These traditional networks just take in the inputs without assuming any dependencies between them and this might be a problem when we are working on problems like natural language processing , speech/audio processing and also in vision problems where adjacent pixels are almost the same.  

In this particular capstone project ,i will be working on the language processing problem where i will train a network on a particular kind of text and then try to generate similar kind of text from the model.


The papers that i have referred are given below:

[Language Model based on Recurrent Neural Network](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf)

[Extensions of recurrent neural network language model](http://www.fit.vutbr.cz/research/groups/speech/publi/2011/mikolov_icassp2011_5528.pdf)

[Generating Text with Recurrent Neural Networks](http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Sutskever_524.pdf)

### Language Modeling

Our goal is to build a Language Model using a Recurrent Neural Network. Let's say we have sentence of  n  words. Language Model allows us to predict the probability of observing the sentence (in a given dataset) as:
P(w1,...,wn)=∏i=1nP(wi∣w1,...,wi−1)P(w1,...,wn)=∏i=1nP(wi∣w1,...,wi−1) 
In words, the probability of a sentence is the product of probabilities of each word given the words that came before it. So, the probability of the sentence "He went to buy some chocolate" would be the probability of "chocolate" given "He went to buy some", multiplied by the probability of "some" given "He went to buy", and so on.
Why is that useful? Why would we want to assign a probability to observing a sentence?
First, such a model can be used as a scoring mechanism. For example, a Machine Translation system typically generates multiple candidates for an input sentence. You could use a language model to pick the most probable sentence. Intuitively, the most probable sentence is likely to be grammatically correct. Similar scoring happens in speech recognition systems.
But solving the Language Modeling problem also has a cool side effect. Because we can predict the probability of a word given the preceding words, we are able to generate new text. It's a generative model. Given an existing sequence of words we sample a next word from the predicted probabilities, and repeat the process until we have a full sentence.__ And this is exactly what we are  going to do i.e: Generate new text__

We will be using a special case of RNN's called [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory) since vanilla RNN's cannot hold on to the memory for a long time and suffers from Vanishing and Exploding gradient problem. Its just a simple manipulation of vanilla RNN and there are many different variants in Rnn's , the other famous one being gated recurrent units [GRU](https://perso.ens-lyon.fr/tristan.sterin/papers/An_Intrinsic_Difference_Between_Vanilla_GRU_Sterin_Farrugia_Gripon.pdf)

So lets get started with the implementation

In tis section i will tell the steps in brief that we will perform to get the desired results from the built model.

1. Import all the necessary libraries and perform pre-processing.
2. Create mini-batches for faster training purposes.
3. Feed these batches to the ML model and train it using theses data.
4. Compute the training loss and optimize it by tuning the hyper-parameters which will best suit the network.
5. Perform Sampling to generate the new text.

Now let's go in detail with every single step and generate new text. __Sounds exciting__. 

###  Pre-processing

Since our input is text and we cannot just input them to our model as it understands only numbers and not text, we'll load the text file and convert it into integers for our network to use. We will create a couple dictionaries to convert the characters to and from integers. Encoding the characters as integers makes it easier to use as input in the network.

In [1]:
import time
from collections import namedtuple
import pandas as pd
import numpy as np
import tensorflow as tf

import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import matplotlib.pyplot as plt


In [2]:
with open('reddit.txt', 'r') as f:
    text=f.read()
vocab = sorted(set(text))
vocab_to_int = {c: i for i, c in enumerate(vocab)}
int_to_vocab = dict(enumerate(vocab))
encoded = np.array([vocab_to_int[c] for c in text], dtype=np.int32)

In [3]:
text[:100]

'body\n"I joined a new league this year and they have different scoring rules than I\'m used to. It\'s a'

In [4]:
encoded[:100]

array([66, 79, 68, 89,  1,  4, 41,  2, 74, 79, 73, 78, 69, 68,  2, 65,  2,
       78, 69, 87,  2, 76, 69, 65, 71, 85, 69,  2, 84, 72, 73, 83,  2, 89,
       69, 65, 82,  2, 65, 78, 68,  2, 84, 72, 69, 89,  2, 72, 65, 86, 69,
        2, 68, 73, 70, 70, 69, 82, 69, 78, 84,  2, 83, 67, 79, 82, 73, 78,
       71,  2, 82, 85, 76, 69, 83,  2, 84, 72, 65, 78,  2, 41,  9, 77,  2,
       85, 83, 69, 68,  2, 84, 79, 16,  2, 41, 84,  9, 83,  2, 65], dtype=int32)

###  Creating Mini-batches

Now that we have our text encoded into one long sequence of integers, Let's create a function that will give us an iterator for our batches. I like using [generator functions](https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/) to do this. Then we can pass encoded into this function and get our batch generator.

The reason why we do batching operation is to utilize the matrix operation efficiently during the training so the RNN is training on multiple sequences in parallel which will reduce our training time. 

We need to do is discard some of the text so we only have completely full batches. Each batch contains $N \times M$ characters, where $N$ is the batch size (the number of sequences) and $M$ is the number of steps. Then, to get the number of batches we can make from some array arr, you divide the length of arr by the batch size. Once you know the number of batches and the batch size, you can get the total number of characters to keep.

In [5]:
def get_batches(arr, n_seqs, n_steps):
    '''Create a generator that returns batches of size
       n_seqs x n_steps from arr.
       
       Arguments
       ---------
       arr: Array you want to make batches from
       n_seqs: Batch size, the number of sequences per batch
       n_steps: Number of sequence steps per batch
    '''
    # Get the number of characters per batch and number of batches we can make
    batch_size = n_seqs * n_steps
    n_batches = len(arr)//batch_size
    
    # Keep only enough characters to make full batches
    arr = arr[:n_batches*batch_size]
    
    # Reshape into n_seqs rows
    arr = arr.reshape((n_seqs,-1))
    
    for n in range(0, arr.shape[1], n_steps):
        # The features
        x = arr[:,n:n+n_steps]
        # The targets, shifted by one
        y = np.zeros_like(x)
        y[:,:-1], y[:,-1] = x[:,1:], x[:, 0] 
        yield x, y

So lets have a batch_size of 500 and sequence steps of 50 and call our generator.

In [6]:
batches = get_batches(encoded, 500,50)
x, y = next(batches)

In [7]:
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[66 79 68 89  1  4 41  2 74 79]
 [69 67 72  2 67 65 82 68 83  2]
 [79  2 65 78 68  2 65 82 69 14]
 [84 72 69  2 79 84 72 69 82 83]
 [ 2 73 78 86 79 76 86 69 68  2]
 [73 82  2 79 87 78  2 82 69 80]
 [86 69  2 35 85 76 84 85 82 69]
 [82 83  2 82 69 66 85 73 76 68]
 [68  2 84 72 69  2 73 78 73 84]
 [82 65 84 83  2 73 83  2 19 20]]

y
 [[79 68 89  1  4 41  2 74 79 73]
 [67 72  2 67 65 82 68 83  2 74]
 [ 2 65 78 68  2 65 82 69 14  2]
 [72 69  2 79 84 72 69 82 83  2]
 [73 78 86 79 76 86 69 68  2 66]
 [82  2 79 87 78  2 82 69 80 82]
 [69  2 35 85 76 84 85 82 69 14]
 [83  2 82 69 66 85 73 76 68  2]
 [ 2 84 72 69  2 73 78 73 84 73]
 [65 84 83  2 73 83  2 19 20 25]]


### Building the Model

This part will be divided into 3 different parts

1. create input place holders
2. build the LSTM cells.
3. Define the RNN output.

#### Creating the input placeholders

First off we'll create our input placeholders. As usual we need placeholders for the training data and the targets. We'll also create a placeholder for dropout layers called keep_prob. This will be a scalar, that is a 0-D tensor. To make a scalar, you create a placeholder without giving it a size.

In [8]:
def build_inputs(batch_size, num_steps):
    ''' Define placeholders for inputs, targets, and dropout 
    
        Arguments
        ---------
        batch_size: Batch size, number of sequences per batch
        num_steps: Number of sequence steps in a batch
        
    '''
    # Declare placeholders we'll feed into the graph
    inputs = tf.placeholder(tf.int32, [batch_size, num_steps], name='inputs')
    targets = tf.placeholder(tf.int32, [batch_size, num_steps], name='targets')
    
    # Keep probability placeholder for drop out layers
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    
    return inputs, targets, keep_prob

#### Building the LSTM cells.

Here we will create the LSTM cell we'll use in the hidden layer. We'll use this cell as a building block for the RNN. So we aren't actually defining the RNN here, just the type of cell we'll use in the hidden layer.

We first create a basic LSTM cell with

             lstm = tf.contrib.rnn.BasicLSTMCell(num_units)

where num_units is the number of units in the hidden layers in the cell. Then we can add dropout by wrapping it with

             tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)

You pass in a cell and it will automatically add dropout to the inputs or outputs. Finally, we can stack up the LSTM cells into layers with tf.contrib.rnn.MultiRNNCell. With this, you pass in a list of cells and it will send the output of one cell into the next cell. Previously with TensorFlow 1.0, you could do this

              tf.contrib.rnn.MultiRNNCell([cell]*num_layers)

In [9]:
def build_lstm(lstm_size, num_layers, batch_size, keep_prob):
    ''' Build LSTM cell.
    
        Arguments
        ---------
        keep_prob: Scalar tensor (tf.placeholder) for the dropout keep probability
        lstm_size: Size of the hidden layers in the LSTM cells
        num_layers: Number of LSTM layers
        batch_size: Batch size

    '''
    ### Build the LSTM Cell
    # Use a basic LSTM cell
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    
    # Add dropout to the cell outputs
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([drop]*num_layers)
    initial_state = cell.zero_state(batch_size, tf.float32)
    
    return cell, initial_state

#### RNN Output

Here we'll create the output layer. We need to connect the output of the RNN cells to a full connected layer with a softmax output. The softmax output gives us a probability distribution we can use to predict the next character, so we want this layer to have size $C$, the number of classes/characters we have in our text.

In [10]:
def build_output(lstm_output, in_size, out_size):
    ''' Build a softmax layer, return the softmax output and logits.
    
        Arguments
        ---------
        
        lstm_output: List of output tensors from the LSTM layer
        in_size: Size of the input tensor, for example, size of the LSTM cells
        out_size: Size of this softmax layer
    
    '''

    # Reshape output so it's a bunch of rows, one row for each step for each sequence.
    # Concatenate lstm_output over axis 1 (the columns)
    
    seq_output = tf.concat(lstm_output, axis=1)
    
    # Reshape seq_output to a 2D tensor with lstm_size columns
    x = tf.reshape(seq_output,[-1, lstm_size])
    
    # Connect the RNN outputs to a softmax layer
    with tf.variable_scope('softmax'):
        # Create the weight and bias variables here
        softmax_w = tf.Variable(tf.truncated_normal((in_size,out_size),stddev=0.1))
        softmax_b = tf.Variable(tf.zeros(out_size))
    
    # Since output is a bunch of rows of RNN cell outputs, logits will be a bunch
    # of rows of logit outputs, one for each step and sequence
    logits = tf.matmul(x, softmax_w) + softmax_b
    
    # Use softmax to get the probabilities for predicted characters
    out = tf.nn.softmax(logits, name='predictions')
    
    return out, logits

#### Training loss

 We get the logits and targets and calculate the softmax cross-entropy loss. First we need to one-hot encode the targets, we're getting them as encoded characters. Then, reshape the one-hot targets so it's a 2D tensor with size $(M*N) \times C$ where $C$ is the number of classes/characters we have. Remember that we reshaped the LSTM outputs and ran them through a fully connected layer with $C$ units. So our logits will also have size $(M*N) \times C$.
Then we run the logits and targets through tf.nn.softmax_cross_entropy_with_logits and find the mean to get the loss.

In [11]:
def build_loss(logits, targets, lstm_size, num_classes):
    ''' Calculate the loss from the logits and the targets.
    
        Arguments
        ---------
        logits: Logits from final fully connected layer
        targets: Targets for supervised learning
        lstm_size: Number of LSTM hidden units
        num_classes: Number of classes in targets
        
    '''
    
    # One-hot encode targets and reshape to match logits, one row per sequence per step
    y_one_hot = tf.one_hot(targets, num_classes)
    y_reshaped =  tf.reshape(y_one_hot, logits.get_shape())
    
    # Softmax cross entropy loss
    loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_reshaped)
    loss = tf.reduce_mean(loss)
    return loss

#### Optimizer

Normal RNNs have have issues of gradients exploding and gradient vanishing. LSTMs fix the disappearance problem, but the gradients can still grow without bound. To fix this, we can clip the gradients above some threshold. That is, if a gradient is larger than that threshold, we set it to the threshold. This will ensure the gradients never grow overly large. Then we use an AdamOptimizer for the learning step.

In [12]:
def build_optimizer(loss, learning_rate, grad_clip):
    ''' Build optmizer for training, using gradient clipping.
    
        Arguments:
        loss: Network loss
        learning_rate: Learning rate for optimizer
    
    '''
    
    # Optimizer for training, using gradient clipping to control exploding gradients
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), grad_clip)
    train_op = tf.train.AdamOptimizer(learning_rate)
    optimizer = train_op.apply_gradients(zip(grads, tvars))
    
    return optimizer

#### Build the network

here we will use tf.nn.dynamic_rnn  This function will pass the hidden and cell states across LSTM cells appropriately for us. It returns the outputs for each LSTM cell at each step for each sequence in the mini-batch. It also gives us the final LSTM state. We want to save this state as final_state so we can pass it to the first LSTM cell in the the next mini-batch run. For tf.nn.dynamic_rnn, we pass in the cell and initial state we get from build_lstm, as well as our input sequences. Also, we need to one-hot encode the inputs before going into the RNN.

In [13]:
class CharRNN:
    
    def __init__(self, num_classes, batch_size=64, num_steps=50, 
                       lstm_size=128, num_layers=2, learning_rate=0.001, 
                       grad_clip=5, sampling=False):
    
        # When we're using this network for sampling later, we'll be passing in
        # one character at a time, so providing an option for that
        if sampling == True:
            batch_size, num_steps = 1, 1
        else:
            batch_size, num_steps = batch_size, num_steps

        tf.reset_default_graph()
        
        # Build the input placeholder tensors
        self.inputs, self.targets, self.keep_prob = build_inputs(batch_size, num_steps)

        # Build the LSTM cell
        cell, self.initial_state = build_lstm(lstm_size, num_layers, batch_size, self.keep_prob)

        ### Run the data through the RNN layers
        # First, one-hot encode the input tokens
        x_one_hot = tf.one_hot(self.inputs, num_classes)
        
        # Run each sequence step through the RNN with tf.nn.dynamic_rnn 
        outputs, state = tf.nn.dynamic_rnn(cell, x_one_hot, initial_state = self.initial_state)
        self.final_state = state
        
        # Get softmax predictions and logits
        self.prediction, self.logits =  build_output(outputs, lstm_size, num_classes)
        
        # Loss and optimizer (with gradient clipping)
        self.loss =  build_loss(self.logits, self.targets, lstm_size, num_classes)
        self.optimizer = build_optimizer(self.loss, learning_rate, grad_clip)

#### Hyperparameters

1. batch_size - Number of sequences running through the network in one pass.
2. num_steps - Number of characters in the sequence the network is trained on. Larger is better typically, the network will learn more long range dependencies. But it takes longer to train. 100 is typically a good number here.
3. lstm_size - The number of units in the hidden layers.
4. num_layers - Number of hidden LSTM layers to use
5. learning_rate - Learning rate for training
6. keep_prob - The dropout keep probability when training. If you're network is overfitting, try decreasing this.

### Best models strategy

The winning strategy to obtaining very good models (if you have the compute time) is to always err on making the network larger (as large as you're willing to wait for it to compute) and then try different dropout values (between 0,1). Whatever model has the best validation performance (the loss, written in the checkpoint filename, low is good) is the one you should use in the end.

It is very common in deep learning to run many different models with many different hyperparameter settings, and in the end take whatever checkpoint gave the best validation performance.


In [14]:
batch_size = 500         # Sequences per batch
num_steps = 50          # Number of sequence steps per batch
lstm_size = 128         # Size of hidden layers in LSTMs
num_layers = 2          # Number of LSTM layers
learning_rate = 0.01    # Learning rate
keep_prob = 0.5         # Dropout keep probability

###  Training

In [20]:
epochs = 15
# Save every N iterations
save_every_n = 200

model = CharRNN(len(vocab), batch_size=batch_size, num_steps=num_steps,
                lstm_size=lstm_size, num_layers=num_layers, 
                learning_rate=learning_rate)

saver = tf.train.Saver(max_to_keep=100)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    # Use the line below to load a checkpoint and resume training
    #saver.restore(sess, 'checkpoints/______.ckpt')
    counter = 0
    for e in range(epochs):
        # Train network
        new_state = sess.run(model.initial_state)
        loss = 0
        for x, y in get_batches(encoded, batch_size, num_steps):
            counter += 1
            start = time.time()
            feed = {model.inputs: x,
                    model.targets: y,
                    model.keep_prob: keep_prob,
                    model.initial_state: new_state}
            batch_loss, new_state, _ = sess.run([model.loss, 
                                                 model.final_state, 
                                                 model.optimizer], 
                                                 feed_dict=feed)
            
            end = time.time()
            print('Epoch: {}/{}... '.format(e+1, epochs),
                  'Training Step: {}... '.format(counter),
                  'Training loss: {:.4f}... '.format(batch_loss),
                  '{:.4f} sec/batch'.format((end-start)))
        
            if (counter % save_every_n == 0):
                saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))
    
    saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))

Epoch: 1/15...  Training Step: 1...  Training loss: 6.1200...  0.2299 sec/batch
Epoch: 1/15...  Training Step: 2...  Training loss: 6.0304...  0.1727 sec/batch
Epoch: 1/15...  Training Step: 3...  Training loss: 4.7055...  0.1593 sec/batch
Epoch: 1/15...  Training Step: 4...  Training loss: 3.9120...  0.1469 sec/batch
Epoch: 1/15...  Training Step: 5...  Training loss: 3.6852...  0.1392 sec/batch
Epoch: 1/15...  Training Step: 6...  Training loss: 3.5991...  0.1354 sec/batch
Epoch: 1/15...  Training Step: 7...  Training loss: 3.5895...  0.1331 sec/batch
Epoch: 1/15...  Training Step: 8...  Training loss: 3.5589...  0.1355 sec/batch
Epoch: 1/15...  Training Step: 9...  Training loss: 3.5036...  0.1329 sec/batch
Epoch: 1/15...  Training Step: 10...  Training loss: 3.4853...  0.1326 sec/batch
Epoch: 1/15...  Training Step: 11...  Training loss: 3.4519...  0.1311 sec/batch
Epoch: 1/15...  Training Step: 12...  Training loss: 3.4404...  0.1338 sec/batch
Epoch: 1/15...  Training Step: 13... 

Epoch: 1/15...  Training Step: 103...  Training loss: 2.8949...  0.1304 sec/batch
Epoch: 1/15...  Training Step: 104...  Training loss: 2.8600...  0.1368 sec/batch
Epoch: 1/15...  Training Step: 105...  Training loss: 2.8875...  0.1351 sec/batch
Epoch: 1/15...  Training Step: 106...  Training loss: 2.8722...  0.1289 sec/batch
Epoch: 1/15...  Training Step: 107...  Training loss: 2.8572...  0.1317 sec/batch
Epoch: 1/15...  Training Step: 108...  Training loss: 2.8460...  0.1303 sec/batch
Epoch: 1/15...  Training Step: 109...  Training loss: 2.8669...  0.1328 sec/batch
Epoch: 1/15...  Training Step: 110...  Training loss: 2.8448...  0.1310 sec/batch
Epoch: 1/15...  Training Step: 111...  Training loss: 2.8700...  0.1338 sec/batch
Epoch: 1/15...  Training Step: 112...  Training loss: 2.8127...  0.1350 sec/batch
Epoch: 1/15...  Training Step: 113...  Training loss: 2.8311...  0.1355 sec/batch
Epoch: 1/15...  Training Step: 114...  Training loss: 2.8185...  0.1343 sec/batch
Epoch: 1/15...  

Epoch: 1/15...  Training Step: 203...  Training loss: 2.5973...  0.1405 sec/batch
Epoch: 1/15...  Training Step: 204...  Training loss: 2.5948...  0.1333 sec/batch
Epoch: 1/15...  Training Step: 205...  Training loss: 2.5859...  0.1336 sec/batch
Epoch: 1/15...  Training Step: 206...  Training loss: 2.5636...  0.1316 sec/batch
Epoch: 1/15...  Training Step: 207...  Training loss: 2.5652...  0.1336 sec/batch
Epoch: 1/15...  Training Step: 208...  Training loss: 2.5560...  0.1301 sec/batch
Epoch: 1/15...  Training Step: 209...  Training loss: 2.5394...  0.1374 sec/batch
Epoch: 1/15...  Training Step: 210...  Training loss: 2.5567...  0.1362 sec/batch
Epoch: 1/15...  Training Step: 211...  Training loss: 2.5413...  0.1338 sec/batch
Epoch: 1/15...  Training Step: 212...  Training loss: 2.5325...  0.1358 sec/batch
Epoch: 1/15...  Training Step: 213...  Training loss: 2.5506...  0.1341 sec/batch
Epoch: 1/15...  Training Step: 214...  Training loss: 2.5296...  0.1343 sec/batch
Epoch: 1/15...  

Epoch: 1/15...  Training Step: 303...  Training loss: 2.4117...  0.1322 sec/batch
Epoch: 2/15...  Training Step: 304...  Training loss: 2.4720...  0.1788 sec/batch
Epoch: 2/15...  Training Step: 305...  Training loss: 2.4112...  0.1845 sec/batch
Epoch: 2/15...  Training Step: 306...  Training loss: 2.3948...  0.1346 sec/batch
Epoch: 2/15...  Training Step: 307...  Training loss: 2.4153...  0.1340 sec/batch
Epoch: 2/15...  Training Step: 308...  Training loss: 2.4052...  0.1313 sec/batch
Epoch: 2/15...  Training Step: 309...  Training loss: 2.4302...  0.1349 sec/batch
Epoch: 2/15...  Training Step: 310...  Training loss: 2.4071...  0.1318 sec/batch
Epoch: 2/15...  Training Step: 311...  Training loss: 2.4105...  0.1343 sec/batch
Epoch: 2/15...  Training Step: 312...  Training loss: 2.3766...  0.1302 sec/batch
Epoch: 2/15...  Training Step: 313...  Training loss: 2.3929...  0.1329 sec/batch
Epoch: 2/15...  Training Step: 314...  Training loss: 2.3757...  0.1340 sec/batch
Epoch: 2/15...  

Epoch: 2/15...  Training Step: 403...  Training loss: 2.3110...  0.1351 sec/batch
Epoch: 2/15...  Training Step: 404...  Training loss: 2.3035...  0.1330 sec/batch
Epoch: 2/15...  Training Step: 405...  Training loss: 2.2991...  0.1351 sec/batch
Epoch: 2/15...  Training Step: 406...  Training loss: 2.2978...  0.1330 sec/batch
Epoch: 2/15...  Training Step: 407...  Training loss: 2.2908...  0.1345 sec/batch
Epoch: 2/15...  Training Step: 408...  Training loss: 2.3083...  0.1334 sec/batch
Epoch: 2/15...  Training Step: 409...  Training loss: 2.3099...  0.1308 sec/batch
Epoch: 2/15...  Training Step: 410...  Training loss: 2.3012...  0.1801 sec/batch
Epoch: 2/15...  Training Step: 411...  Training loss: 2.3036...  0.1843 sec/batch
Epoch: 2/15...  Training Step: 412...  Training loss: 2.3200...  0.1330 sec/batch
Epoch: 2/15...  Training Step: 413...  Training loss: 2.3108...  0.1340 sec/batch
Epoch: 2/15...  Training Step: 414...  Training loss: 2.3301...  0.1305 sec/batch
Epoch: 2/15...  

Epoch: 2/15...  Training Step: 503...  Training loss: 2.2573...  0.1298 sec/batch
Epoch: 2/15...  Training Step: 504...  Training loss: 2.2432...  0.1318 sec/batch
Epoch: 2/15...  Training Step: 505...  Training loss: 2.2435...  0.1329 sec/batch
Epoch: 2/15...  Training Step: 506...  Training loss: 2.2740...  0.1348 sec/batch
Epoch: 2/15...  Training Step: 507...  Training loss: 2.2814...  0.1335 sec/batch
Epoch: 2/15...  Training Step: 508...  Training loss: 2.2759...  0.1333 sec/batch
Epoch: 2/15...  Training Step: 509...  Training loss: 2.2574...  0.1363 sec/batch
Epoch: 2/15...  Training Step: 510...  Training loss: 2.2653...  0.1340 sec/batch
Epoch: 2/15...  Training Step: 511...  Training loss: 2.2601...  0.1328 sec/batch
Epoch: 2/15...  Training Step: 512...  Training loss: 2.2543...  0.1347 sec/batch
Epoch: 2/15...  Training Step: 513...  Training loss: 2.2538...  0.1345 sec/batch
Epoch: 2/15...  Training Step: 514...  Training loss: 2.2402...  0.1363 sec/batch
Epoch: 2/15...  

Epoch: 2/15...  Training Step: 603...  Training loss: 2.2046...  0.1316 sec/batch
Epoch: 2/15...  Training Step: 604...  Training loss: 2.1843...  0.1330 sec/batch
Epoch: 2/15...  Training Step: 605...  Training loss: 2.2015...  0.1303 sec/batch
Epoch: 2/15...  Training Step: 606...  Training loss: 2.1977...  0.1357 sec/batch
Epoch: 3/15...  Training Step: 607...  Training loss: 2.2580...  0.1310 sec/batch
Epoch: 3/15...  Training Step: 608...  Training loss: 2.2120...  0.1328 sec/batch
Epoch: 3/15...  Training Step: 609...  Training loss: 2.1770...  0.1331 sec/batch
Epoch: 3/15...  Training Step: 610...  Training loss: 2.1983...  0.1334 sec/batch
Epoch: 3/15...  Training Step: 611...  Training loss: 2.1949...  0.1333 sec/batch
Epoch: 3/15...  Training Step: 612...  Training loss: 2.2196...  0.1287 sec/batch
Epoch: 3/15...  Training Step: 613...  Training loss: 2.2023...  0.1335 sec/batch
Epoch: 3/15...  Training Step: 614...  Training loss: 2.2047...  0.1336 sec/batch
Epoch: 3/15...  

Epoch: 3/15...  Training Step: 703...  Training loss: 2.1571...  0.1285 sec/batch
Epoch: 3/15...  Training Step: 704...  Training loss: 2.1681...  0.1306 sec/batch
Epoch: 3/15...  Training Step: 705...  Training loss: 2.1375...  0.1312 sec/batch
Epoch: 3/15...  Training Step: 706...  Training loss: 2.1518...  0.1332 sec/batch
Epoch: 3/15...  Training Step: 707...  Training loss: 2.1527...  0.1334 sec/batch
Epoch: 3/15...  Training Step: 708...  Training loss: 2.1436...  0.1340 sec/batch
Epoch: 3/15...  Training Step: 709...  Training loss: 2.1499...  0.1322 sec/batch
Epoch: 3/15...  Training Step: 710...  Training loss: 2.1368...  0.1328 sec/batch
Epoch: 3/15...  Training Step: 711...  Training loss: 2.1480...  0.1313 sec/batch
Epoch: 3/15...  Training Step: 712...  Training loss: 2.1551...  0.1307 sec/batch
Epoch: 3/15...  Training Step: 713...  Training loss: 2.1504...  0.1339 sec/batch
Epoch: 3/15...  Training Step: 714...  Training loss: 2.1493...  0.1315 sec/batch
Epoch: 3/15...  

Epoch: 3/15...  Training Step: 803...  Training loss: 2.1392...  0.1345 sec/batch
Epoch: 3/15...  Training Step: 804...  Training loss: 2.1263...  0.1332 sec/batch
Epoch: 3/15...  Training Step: 805...  Training loss: 2.1289...  0.1357 sec/batch
Epoch: 3/15...  Training Step: 806...  Training loss: 2.1389...  0.1315 sec/batch
Epoch: 3/15...  Training Step: 807...  Training loss: 2.1306...  0.1331 sec/batch
Epoch: 3/15...  Training Step: 808...  Training loss: 2.1332...  0.1309 sec/batch
Epoch: 3/15...  Training Step: 809...  Training loss: 2.1547...  0.1299 sec/batch
Epoch: 3/15...  Training Step: 810...  Training loss: 2.1687...  0.1326 sec/batch
Epoch: 3/15...  Training Step: 811...  Training loss: 2.1582...  0.1345 sec/batch
Epoch: 3/15...  Training Step: 812...  Training loss: 2.1381...  0.1335 sec/batch
Epoch: 3/15...  Training Step: 813...  Training loss: 2.1551...  0.1327 sec/batch
Epoch: 3/15...  Training Step: 814...  Training loss: 2.1554...  0.1350 sec/batch
Epoch: 3/15...  

Epoch: 3/15...  Training Step: 903...  Training loss: 2.1066...  0.1355 sec/batch
Epoch: 3/15...  Training Step: 904...  Training loss: 2.0932...  0.1353 sec/batch
Epoch: 3/15...  Training Step: 905...  Training loss: 2.1146...  0.1323 sec/batch
Epoch: 3/15...  Training Step: 906...  Training loss: 2.1114...  0.1323 sec/batch
Epoch: 3/15...  Training Step: 907...  Training loss: 2.1096...  0.1353 sec/batch
Epoch: 3/15...  Training Step: 908...  Training loss: 2.1060...  0.1338 sec/batch
Epoch: 3/15...  Training Step: 909...  Training loss: 2.1002...  0.1314 sec/batch
Epoch: 4/15...  Training Step: 910...  Training loss: 2.1731...  0.1338 sec/batch
Epoch: 4/15...  Training Step: 911...  Training loss: 2.1257...  0.1312 sec/batch
Epoch: 4/15...  Training Step: 912...  Training loss: 2.0932...  0.1340 sec/batch
Epoch: 4/15...  Training Step: 913...  Training loss: 2.1089...  0.1298 sec/batch
Epoch: 4/15...  Training Step: 914...  Training loss: 2.1104...  0.1316 sec/batch
Epoch: 4/15...  

Epoch: 4/15...  Training Step: 1003...  Training loss: 2.0544...  0.1334 sec/batch
Epoch: 4/15...  Training Step: 1004...  Training loss: 2.0559...  0.1311 sec/batch
Epoch: 4/15...  Training Step: 1005...  Training loss: 2.0672...  0.1306 sec/batch
Epoch: 4/15...  Training Step: 1006...  Training loss: 2.0747...  0.1353 sec/batch
Epoch: 4/15...  Training Step: 1007...  Training loss: 2.0930...  0.1795 sec/batch
Epoch: 4/15...  Training Step: 1008...  Training loss: 2.0601...  0.1842 sec/batch
Epoch: 4/15...  Training Step: 1009...  Training loss: 2.0745...  0.1318 sec/batch
Epoch: 4/15...  Training Step: 1010...  Training loss: 2.0874...  0.1341 sec/batch
Epoch: 4/15...  Training Step: 1011...  Training loss: 2.0763...  0.1301 sec/batch
Epoch: 4/15...  Training Step: 1012...  Training loss: 2.0824...  0.1316 sec/batch
Epoch: 4/15...  Training Step: 1013...  Training loss: 2.0614...  0.1294 sec/batch
Epoch: 4/15...  Training Step: 1014...  Training loss: 2.0764...  0.1344 sec/batch
Epoc

Epoch: 4/15...  Training Step: 1103...  Training loss: 2.0675...  0.1342 sec/batch
Epoch: 4/15...  Training Step: 1104...  Training loss: 2.0436...  0.1334 sec/batch
Epoch: 4/15...  Training Step: 1105...  Training loss: 2.0442...  0.1348 sec/batch
Epoch: 4/15...  Training Step: 1106...  Training loss: 2.0756...  0.1304 sec/batch
Epoch: 4/15...  Training Step: 1107...  Training loss: 2.0658...  0.1811 sec/batch
Epoch: 4/15...  Training Step: 1108...  Training loss: 2.0654...  0.1839 sec/batch
Epoch: 4/15...  Training Step: 1109...  Training loss: 2.0735...  0.1345 sec/batch
Epoch: 4/15...  Training Step: 1110...  Training loss: 2.0705...  0.1332 sec/batch
Epoch: 4/15...  Training Step: 1111...  Training loss: 2.0643...  0.1312 sec/batch
Epoch: 4/15...  Training Step: 1112...  Training loss: 2.0827...  0.1344 sec/batch
Epoch: 4/15...  Training Step: 1113...  Training loss: 2.1046...  0.1299 sec/batch
Epoch: 4/15...  Training Step: 1114...  Training loss: 2.1023...  0.1326 sec/batch
Epoc

Epoch: 4/15...  Training Step: 1203...  Training loss: 2.0956...  0.1327 sec/batch
Epoch: 4/15...  Training Step: 1204...  Training loss: 2.0540...  0.1351 sec/batch
Epoch: 4/15...  Training Step: 1205...  Training loss: 2.0963...  0.1339 sec/batch
Epoch: 4/15...  Training Step: 1206...  Training loss: 2.0451...  0.1311 sec/batch
Epoch: 4/15...  Training Step: 1207...  Training loss: 2.0398...  0.1349 sec/batch
Epoch: 4/15...  Training Step: 1208...  Training loss: 2.0544...  0.1335 sec/batch
Epoch: 4/15...  Training Step: 1209...  Training loss: 2.0546...  0.1311 sec/batch
Epoch: 4/15...  Training Step: 1210...  Training loss: 2.0460...  0.1343 sec/batch
Epoch: 4/15...  Training Step: 1211...  Training loss: 2.0516...  0.1327 sec/batch
Epoch: 4/15...  Training Step: 1212...  Training loss: 2.0458...  0.1341 sec/batch
Epoch: 5/15...  Training Step: 1213...  Training loss: 2.1156...  0.1308 sec/batch
Epoch: 5/15...  Training Step: 1214...  Training loss: 2.0639...  0.1318 sec/batch
Epoc

Epoch: 5/15...  Training Step: 1303...  Training loss: 2.0051...  0.1340 sec/batch
Epoch: 5/15...  Training Step: 1304...  Training loss: 2.0367...  0.1337 sec/batch
Epoch: 5/15...  Training Step: 1305...  Training loss: 2.0287...  0.1327 sec/batch
Epoch: 5/15...  Training Step: 1306...  Training loss: 2.0113...  0.1350 sec/batch
Epoch: 5/15...  Training Step: 1307...  Training loss: 2.0137...  0.1334 sec/batch
Epoch: 5/15...  Training Step: 1308...  Training loss: 2.0239...  0.1325 sec/batch
Epoch: 5/15...  Training Step: 1309...  Training loss: 2.0340...  0.1328 sec/batch
Epoch: 5/15...  Training Step: 1310...  Training loss: 2.0356...  0.1335 sec/batch
Epoch: 5/15...  Training Step: 1311...  Training loss: 2.0074...  0.1291 sec/batch
Epoch: 5/15...  Training Step: 1312...  Training loss: 2.0183...  0.1332 sec/batch
Epoch: 5/15...  Training Step: 1313...  Training loss: 2.0337...  0.1308 sec/batch
Epoch: 5/15...  Training Step: 1314...  Training loss: 2.0231...  0.1350 sec/batch
Epoc

Epoch: 5/15...  Training Step: 1403...  Training loss: 1.9955...  0.1343 sec/batch
Epoch: 5/15...  Training Step: 1404...  Training loss: 2.0170...  0.1330 sec/batch
Epoch: 5/15...  Training Step: 1405...  Training loss: 1.9999...  0.1341 sec/batch
Epoch: 5/15...  Training Step: 1406...  Training loss: 2.0255...  0.1332 sec/batch
Epoch: 5/15...  Training Step: 1407...  Training loss: 1.9987...  0.1339 sec/batch
Epoch: 5/15...  Training Step: 1408...  Training loss: 1.9930...  0.1336 sec/batch
Epoch: 5/15...  Training Step: 1409...  Training loss: 2.0328...  0.1323 sec/batch
Epoch: 5/15...  Training Step: 1410...  Training loss: 2.0140...  0.1327 sec/batch
Epoch: 5/15...  Training Step: 1411...  Training loss: 2.0105...  0.1342 sec/batch
Epoch: 5/15...  Training Step: 1412...  Training loss: 2.0255...  0.1334 sec/batch
Epoch: 5/15...  Training Step: 1413...  Training loss: 2.0178...  0.1343 sec/batch
Epoch: 5/15...  Training Step: 1414...  Training loss: 2.0206...  0.1370 sec/batch
Epoc

Epoch: 5/15...  Training Step: 1503...  Training loss: 2.0257...  0.1322 sec/batch
Epoch: 5/15...  Training Step: 1504...  Training loss: 2.0468...  0.1310 sec/batch
Epoch: 5/15...  Training Step: 1505...  Training loss: 2.0537...  0.1336 sec/batch
Epoch: 5/15...  Training Step: 1506...  Training loss: 2.0527...  0.1337 sec/batch
Epoch: 5/15...  Training Step: 1507...  Training loss: 2.0124...  0.1353 sec/batch
Epoch: 5/15...  Training Step: 1508...  Training loss: 2.0509...  0.1331 sec/batch
Epoch: 5/15...  Training Step: 1509...  Training loss: 2.0067...  0.1351 sec/batch
Epoch: 5/15...  Training Step: 1510...  Training loss: 2.0007...  0.1349 sec/batch
Epoch: 5/15...  Training Step: 1511...  Training loss: 2.0154...  0.1346 sec/batch
Epoch: 5/15...  Training Step: 1512...  Training loss: 2.0198...  0.1327 sec/batch
Epoch: 5/15...  Training Step: 1513...  Training loss: 2.0146...  0.1302 sec/batch
Epoch: 5/15...  Training Step: 1514...  Training loss: 2.0097...  0.1331 sec/batch
Epoc

Epoch: 6/15...  Training Step: 1603...  Training loss: 1.9957...  0.1335 sec/batch
Epoch: 6/15...  Training Step: 1604...  Training loss: 1.9810...  0.1310 sec/batch
Epoch: 6/15...  Training Step: 1605...  Training loss: 1.9987...  0.1351 sec/batch
Epoch: 6/15...  Training Step: 1606...  Training loss: 1.9810...  0.1332 sec/batch
Epoch: 6/15...  Training Step: 1607...  Training loss: 2.0006...  0.1326 sec/batch
Epoch: 6/15...  Training Step: 1608...  Training loss: 1.9869...  0.1323 sec/batch
Epoch: 6/15...  Training Step: 1609...  Training loss: 1.9715...  0.1351 sec/batch
Epoch: 6/15...  Training Step: 1610...  Training loss: 1.9712...  0.1327 sec/batch
Epoch: 6/15...  Training Step: 1611...  Training loss: 1.9888...  0.1346 sec/batch
Epoch: 6/15...  Training Step: 1612...  Training loss: 2.0014...  0.1314 sec/batch
Epoch: 6/15...  Training Step: 1613...  Training loss: 2.0036...  0.1326 sec/batch
Epoch: 6/15...  Training Step: 1614...  Training loss: 1.9753...  0.1332 sec/batch
Epoc

Epoch: 6/15...  Training Step: 1703...  Training loss: 1.9735...  0.1329 sec/batch
Epoch: 6/15...  Training Step: 1704...  Training loss: 1.9839...  0.1365 sec/batch
Epoch: 6/15...  Training Step: 1705...  Training loss: 1.9824...  0.1343 sec/batch
Epoch: 6/15...  Training Step: 1706...  Training loss: 1.9617...  0.1328 sec/batch
Epoch: 6/15...  Training Step: 1707...  Training loss: 1.9838...  0.1310 sec/batch
Epoch: 6/15...  Training Step: 1708...  Training loss: 1.9822...  0.1352 sec/batch
Epoch: 6/15...  Training Step: 1709...  Training loss: 1.9937...  0.1310 sec/batch
Epoch: 6/15...  Training Step: 1710...  Training loss: 1.9652...  0.1325 sec/batch
Epoch: 6/15...  Training Step: 1711...  Training loss: 1.9662...  0.1342 sec/batch
Epoch: 6/15...  Training Step: 1712...  Training loss: 2.0051...  0.1334 sec/batch
Epoch: 6/15...  Training Step: 1713...  Training loss: 1.9944...  0.1332 sec/batch
Epoch: 6/15...  Training Step: 1714...  Training loss: 1.9754...  0.1330 sec/batch
Epoc

Epoch: 6/15...  Training Step: 1803...  Training loss: 1.9619...  0.1294 sec/batch
Epoch: 6/15...  Training Step: 1804...  Training loss: 1.9740...  0.1339 sec/batch
Epoch: 6/15...  Training Step: 1805...  Training loss: 1.9834...  0.1353 sec/batch
Epoch: 6/15...  Training Step: 1806...  Training loss: 1.9876...  0.1314 sec/batch
Epoch: 6/15...  Training Step: 1807...  Training loss: 2.0157...  0.1337 sec/batch
Epoch: 6/15...  Training Step: 1808...  Training loss: 2.0235...  0.1332 sec/batch
Epoch: 6/15...  Training Step: 1809...  Training loss: 2.0249...  0.1313 sec/batch
Epoch: 6/15...  Training Step: 1810...  Training loss: 1.9859...  0.1333 sec/batch
Epoch: 6/15...  Training Step: 1811...  Training loss: 2.0223...  0.1315 sec/batch
Epoch: 6/15...  Training Step: 1812...  Training loss: 1.9810...  0.1330 sec/batch
Epoch: 6/15...  Training Step: 1813...  Training loss: 1.9735...  0.1341 sec/batch
Epoch: 6/15...  Training Step: 1814...  Training loss: 1.9831...  0.1310 sec/batch
Epoc

Epoch: 7/15...  Training Step: 1903...  Training loss: 1.9793...  0.1346 sec/batch
Epoch: 7/15...  Training Step: 1904...  Training loss: 1.9551...  0.1352 sec/batch
Epoch: 7/15...  Training Step: 1905...  Training loss: 1.9638...  0.1320 sec/batch
Epoch: 7/15...  Training Step: 1906...  Training loss: 1.9636...  0.1348 sec/batch
Epoch: 7/15...  Training Step: 1907...  Training loss: 1.9602...  0.1307 sec/batch
Epoch: 7/15...  Training Step: 1908...  Training loss: 1.9873...  0.1306 sec/batch
Epoch: 7/15...  Training Step: 1909...  Training loss: 1.9518...  0.1350 sec/batch
Epoch: 7/15...  Training Step: 1910...  Training loss: 1.9782...  0.1311 sec/batch
Epoch: 7/15...  Training Step: 1911...  Training loss: 1.9650...  0.1299 sec/batch
Epoch: 7/15...  Training Step: 1912...  Training loss: 1.9464...  0.1350 sec/batch
Epoch: 7/15...  Training Step: 1913...  Training loss: 1.9480...  0.1352 sec/batch
Epoch: 7/15...  Training Step: 1914...  Training loss: 1.9579...  0.1340 sec/batch
Epoc

Epoch: 7/15...  Training Step: 2003...  Training loss: 1.9634...  0.1345 sec/batch
Epoch: 7/15...  Training Step: 2004...  Training loss: 1.9625...  0.1359 sec/batch
Epoch: 7/15...  Training Step: 2005...  Training loss: 1.9457...  0.1287 sec/batch
Epoch: 7/15...  Training Step: 2006...  Training loss: 1.9470...  0.1334 sec/batch
Epoch: 7/15...  Training Step: 2007...  Training loss: 1.9549...  0.1341 sec/batch
Epoch: 7/15...  Training Step: 2008...  Training loss: 1.9654...  0.1335 sec/batch
Epoch: 7/15...  Training Step: 2009...  Training loss: 1.9431...  0.1301 sec/batch
Epoch: 7/15...  Training Step: 2010...  Training loss: 1.9546...  0.1328 sec/batch
Epoch: 7/15...  Training Step: 2011...  Training loss: 1.9504...  0.1344 sec/batch
Epoch: 7/15...  Training Step: 2012...  Training loss: 1.9648...  0.1317 sec/batch
Epoch: 7/15...  Training Step: 2013...  Training loss: 1.9397...  0.1307 sec/batch
Epoch: 7/15...  Training Step: 2014...  Training loss: 1.9368...  0.1336 sec/batch
Epoc

Epoch: 7/15...  Training Step: 2103...  Training loss: 1.9217...  0.1294 sec/batch
Epoch: 7/15...  Training Step: 2104...  Training loss: 1.9490...  0.1340 sec/batch
Epoch: 7/15...  Training Step: 2105...  Training loss: 1.9731...  0.1304 sec/batch
Epoch: 7/15...  Training Step: 2106...  Training loss: 1.9440...  0.1356 sec/batch
Epoch: 7/15...  Training Step: 2107...  Training loss: 1.9531...  0.1348 sec/batch
Epoch: 7/15...  Training Step: 2108...  Training loss: 1.9746...  0.1343 sec/batch
Epoch: 7/15...  Training Step: 2109...  Training loss: 1.9710...  0.1350 sec/batch
Epoch: 7/15...  Training Step: 2110...  Training loss: 2.0021...  0.1353 sec/batch
Epoch: 7/15...  Training Step: 2111...  Training loss: 1.9935...  0.1315 sec/batch
Epoch: 7/15...  Training Step: 2112...  Training loss: 2.0094...  0.1354 sec/batch
Epoch: 7/15...  Training Step: 2113...  Training loss: 1.9631...  0.1311 sec/batch
Epoch: 7/15...  Training Step: 2114...  Training loss: 1.9985...  0.1334 sec/batch
Epoc

Epoch: 8/15...  Training Step: 2203...  Training loss: 1.9758...  0.1315 sec/batch
Epoch: 8/15...  Training Step: 2204...  Training loss: 1.9476...  0.1338 sec/batch
Epoch: 8/15...  Training Step: 2205...  Training loss: 1.9457...  0.1339 sec/batch
Epoch: 8/15...  Training Step: 2206...  Training loss: 1.9558...  0.1329 sec/batch
Epoch: 8/15...  Training Step: 2207...  Training loss: 1.9355...  0.1359 sec/batch
Epoch: 8/15...  Training Step: 2208...  Training loss: 1.9451...  0.1346 sec/batch
Epoch: 8/15...  Training Step: 2209...  Training loss: 1.9571...  0.1339 sec/batch
Epoch: 8/15...  Training Step: 2210...  Training loss: 1.9430...  0.1339 sec/batch
Epoch: 8/15...  Training Step: 2211...  Training loss: 1.9593...  0.1330 sec/batch
Epoch: 8/15...  Training Step: 2212...  Training loss: 1.9286...  0.1329 sec/batch
Epoch: 8/15...  Training Step: 2213...  Training loss: 1.9641...  0.1339 sec/batch
Epoch: 8/15...  Training Step: 2214...  Training loss: 1.9528...  0.1302 sec/batch
Epoc

Epoch: 8/15...  Training Step: 2303...  Training loss: 1.9503...  0.1337 sec/batch
Epoch: 8/15...  Training Step: 2304...  Training loss: 1.9389...  0.1333 sec/batch
Epoch: 8/15...  Training Step: 2305...  Training loss: 1.9363...  0.1319 sec/batch
Epoch: 8/15...  Training Step: 2306...  Training loss: 1.9495...  0.1339 sec/batch
Epoch: 8/15...  Training Step: 2307...  Training loss: 1.9460...  0.1366 sec/batch
Epoch: 8/15...  Training Step: 2308...  Training loss: 1.9310...  0.1307 sec/batch
Epoch: 8/15...  Training Step: 2309...  Training loss: 1.9331...  0.1354 sec/batch
Epoch: 8/15...  Training Step: 2310...  Training loss: 1.9354...  0.1343 sec/batch
Epoch: 8/15...  Training Step: 2311...  Training loss: 1.9423...  0.1334 sec/batch
Epoch: 8/15...  Training Step: 2312...  Training loss: 1.9230...  0.1355 sec/batch
Epoch: 8/15...  Training Step: 2313...  Training loss: 1.9307...  0.1329 sec/batch
Epoch: 8/15...  Training Step: 2314...  Training loss: 1.9236...  0.1360 sec/batch
Epoc

Epoch: 8/15...  Training Step: 2403...  Training loss: 1.9314...  0.1326 sec/batch
Epoch: 8/15...  Training Step: 2404...  Training loss: 1.9384...  0.1361 sec/batch
Epoch: 8/15...  Training Step: 2405...  Training loss: 1.9272...  0.1354 sec/batch
Epoch: 8/15...  Training Step: 2406...  Training loss: 1.9058...  0.1331 sec/batch
Epoch: 8/15...  Training Step: 2407...  Training loss: 1.9306...  0.1334 sec/batch
Epoch: 8/15...  Training Step: 2408...  Training loss: 1.9579...  0.1336 sec/batch
Epoch: 8/15...  Training Step: 2409...  Training loss: 1.9249...  0.1364 sec/batch
Epoch: 8/15...  Training Step: 2410...  Training loss: 1.9293...  0.1294 sec/batch
Epoch: 8/15...  Training Step: 2411...  Training loss: 1.9458...  0.1348 sec/batch
Epoch: 8/15...  Training Step: 2412...  Training loss: 1.9614...  0.1310 sec/batch
Epoch: 8/15...  Training Step: 2413...  Training loss: 1.9801...  0.1337 sec/batch
Epoch: 8/15...  Training Step: 2414...  Training loss: 1.9780...  0.1327 sec/batch
Epoc

Epoch: 9/15...  Training Step: 2503...  Training loss: 1.9367...  0.1309 sec/batch
Epoch: 9/15...  Training Step: 2504...  Training loss: 1.9315...  0.1326 sec/batch
Epoch: 9/15...  Training Step: 2505...  Training loss: 1.9571...  0.1328 sec/batch
Epoch: 9/15...  Training Step: 2506...  Training loss: 1.9591...  0.1309 sec/batch
Epoch: 9/15...  Training Step: 2507...  Training loss: 1.9302...  0.1306 sec/batch
Epoch: 9/15...  Training Step: 2508...  Training loss: 1.9274...  0.1361 sec/batch
Epoch: 9/15...  Training Step: 2509...  Training loss: 1.9425...  0.1352 sec/batch
Epoch: 9/15...  Training Step: 2510...  Training loss: 1.9243...  0.1324 sec/batch
Epoch: 9/15...  Training Step: 2511...  Training loss: 1.9194...  0.1342 sec/batch
Epoch: 9/15...  Training Step: 2512...  Training loss: 1.9307...  0.1348 sec/batch
Epoch: 9/15...  Training Step: 2513...  Training loss: 1.9292...  0.1332 sec/batch
Epoch: 9/15...  Training Step: 2514...  Training loss: 1.9362...  0.1347 sec/batch
Epoc

Epoch: 9/15...  Training Step: 2603...  Training loss: 1.9138...  0.1331 sec/batch
Epoch: 9/15...  Training Step: 2604...  Training loss: 1.9094...  0.1335 sec/batch
Epoch: 9/15...  Training Step: 2605...  Training loss: 1.9112...  0.1336 sec/batch
Epoch: 9/15...  Training Step: 2606...  Training loss: 1.9372...  0.1341 sec/batch
Epoch: 9/15...  Training Step: 2607...  Training loss: 1.9296...  0.1309 sec/batch
Epoch: 9/15...  Training Step: 2608...  Training loss: 1.9139...  0.1315 sec/batch
Epoch: 9/15...  Training Step: 2609...  Training loss: 1.9284...  0.1325 sec/batch
Epoch: 9/15...  Training Step: 2610...  Training loss: 1.9219...  0.1313 sec/batch
Epoch: 9/15...  Training Step: 2611...  Training loss: 1.9166...  0.1337 sec/batch
Epoch: 9/15...  Training Step: 2612...  Training loss: 1.9131...  0.1348 sec/batch
Epoch: 9/15...  Training Step: 2613...  Training loss: 1.9163...  0.1879 sec/batch
Epoch: 9/15...  Training Step: 2614...  Training loss: 1.9196...  0.1771 sec/batch
Epoc

Epoch: 9/15...  Training Step: 2703...  Training loss: 1.8973...  0.1319 sec/batch
Epoch: 9/15...  Training Step: 2704...  Training loss: 1.9093...  0.1339 sec/batch
Epoch: 9/15...  Training Step: 2705...  Training loss: 1.9017...  0.1357 sec/batch
Epoch: 9/15...  Training Step: 2706...  Training loss: 1.9175...  0.1306 sec/batch
Epoch: 9/15...  Training Step: 2707...  Training loss: 1.9229...  0.1323 sec/batch
Epoch: 9/15...  Training Step: 2708...  Training loss: 1.9121...  0.1334 sec/batch
Epoch: 9/15...  Training Step: 2709...  Training loss: 1.8911...  0.1354 sec/batch
Epoch: 9/15...  Training Step: 2710...  Training loss: 1.9196...  0.1306 sec/batch
Epoch: 9/15...  Training Step: 2711...  Training loss: 1.9517...  0.1334 sec/batch
Epoch: 9/15...  Training Step: 2712...  Training loss: 1.9104...  0.1349 sec/batch
Epoch: 9/15...  Training Step: 2713...  Training loss: 1.9135...  0.1347 sec/batch
Epoch: 9/15...  Training Step: 2714...  Training loss: 1.9330...  0.1345 sec/batch
Epoc

Epoch: 10/15...  Training Step: 2801...  Training loss: 1.9115...  0.1354 sec/batch
Epoch: 10/15...  Training Step: 2802...  Training loss: 1.8999...  0.1334 sec/batch
Epoch: 10/15...  Training Step: 2803...  Training loss: 1.9214...  0.1330 sec/batch
Epoch: 10/15...  Training Step: 2804...  Training loss: 1.9407...  0.1309 sec/batch
Epoch: 10/15...  Training Step: 2805...  Training loss: 1.9423...  0.1313 sec/batch
Epoch: 10/15...  Training Step: 2806...  Training loss: 1.9291...  0.1341 sec/batch
Epoch: 10/15...  Training Step: 2807...  Training loss: 1.9167...  0.1348 sec/batch
Epoch: 10/15...  Training Step: 2808...  Training loss: 1.9503...  0.1347 sec/batch
Epoch: 10/15...  Training Step: 2809...  Training loss: 1.9448...  0.1359 sec/batch
Epoch: 10/15...  Training Step: 2810...  Training loss: 1.9182...  0.1332 sec/batch
Epoch: 10/15...  Training Step: 2811...  Training loss: 1.9238...  0.1333 sec/batch
Epoch: 10/15...  Training Step: 2812...  Training loss: 1.9230...  0.1343 se

Epoch: 10/15...  Training Step: 2899...  Training loss: 1.9168...  0.1305 sec/batch
Epoch: 10/15...  Training Step: 2900...  Training loss: 1.9129...  0.1326 sec/batch
Epoch: 10/15...  Training Step: 2901...  Training loss: 1.9097...  0.1330 sec/batch
Epoch: 10/15...  Training Step: 2902...  Training loss: 1.9051...  0.1334 sec/batch
Epoch: 10/15...  Training Step: 2903...  Training loss: 1.8869...  0.1320 sec/batch
Epoch: 10/15...  Training Step: 2904...  Training loss: 1.9136...  0.1302 sec/batch
Epoch: 10/15...  Training Step: 2905...  Training loss: 1.9048...  0.1343 sec/batch
Epoch: 10/15...  Training Step: 2906...  Training loss: 1.8955...  0.1337 sec/batch
Epoch: 10/15...  Training Step: 2907...  Training loss: 1.8992...  0.1302 sec/batch
Epoch: 10/15...  Training Step: 2908...  Training loss: 1.9081...  0.1332 sec/batch
Epoch: 10/15...  Training Step: 2909...  Training loss: 1.9240...  0.1363 sec/batch
Epoch: 10/15...  Training Step: 2910...  Training loss: 1.9028...  0.1343 se

Epoch: 10/15...  Training Step: 2997...  Training loss: 1.9133...  0.1340 sec/batch
Epoch: 10/15...  Training Step: 2998...  Training loss: 1.8834...  0.1338 sec/batch
Epoch: 10/15...  Training Step: 2999...  Training loss: 1.8869...  0.1325 sec/batch
Epoch: 10/15...  Training Step: 3000...  Training loss: 1.9093...  0.1346 sec/batch
Epoch: 10/15...  Training Step: 3001...  Training loss: 1.8534...  0.1332 sec/batch
Epoch: 10/15...  Training Step: 3002...  Training loss: 1.8952...  0.1324 sec/batch
Epoch: 10/15...  Training Step: 3003...  Training loss: 1.9022...  0.1331 sec/batch
Epoch: 10/15...  Training Step: 3004...  Training loss: 1.8783...  0.1341 sec/batch
Epoch: 10/15...  Training Step: 3005...  Training loss: 1.8649...  0.1345 sec/batch
Epoch: 10/15...  Training Step: 3006...  Training loss: 1.8765...  0.1325 sec/batch
Epoch: 10/15...  Training Step: 3007...  Training loss: 1.8934...  0.1369 sec/batch
Epoch: 10/15...  Training Step: 3008...  Training loss: 1.8945...  0.1343 se

Epoch: 11/15...  Training Step: 3095...  Training loss: 1.9169...  0.1352 sec/batch
Epoch: 11/15...  Training Step: 3096...  Training loss: 1.9434...  0.1338 sec/batch
Epoch: 11/15...  Training Step: 3097...  Training loss: 1.9200...  0.1364 sec/batch
Epoch: 11/15...  Training Step: 3098...  Training loss: 1.8814...  0.1346 sec/batch
Epoch: 11/15...  Training Step: 3099...  Training loss: 1.8933...  0.1328 sec/batch
Epoch: 11/15...  Training Step: 3100...  Training loss: 1.8868...  0.1312 sec/batch
Epoch: 11/15...  Training Step: 3101...  Training loss: 1.9167...  0.1316 sec/batch
Epoch: 11/15...  Training Step: 3102...  Training loss: 1.9025...  0.1338 sec/batch
Epoch: 11/15...  Training Step: 3103...  Training loss: 1.9041...  0.1329 sec/batch
Epoch: 11/15...  Training Step: 3104...  Training loss: 1.8982...  0.1349 sec/batch
Epoch: 11/15...  Training Step: 3105...  Training loss: 1.8797...  0.1325 sec/batch
Epoch: 11/15...  Training Step: 3106...  Training loss: 1.9085...  0.1328 se

Epoch: 11/15...  Training Step: 3193...  Training loss: 1.8982...  0.1350 sec/batch
Epoch: 11/15...  Training Step: 3194...  Training loss: 1.9064...  0.1344 sec/batch
Epoch: 11/15...  Training Step: 3195...  Training loss: 1.8901...  0.1328 sec/batch
Epoch: 11/15...  Training Step: 3196...  Training loss: 1.9313...  0.1331 sec/batch
Epoch: 11/15...  Training Step: 3197...  Training loss: 1.8947...  0.1306 sec/batch
Epoch: 11/15...  Training Step: 3198...  Training loss: 1.9117...  0.1347 sec/batch
Epoch: 11/15...  Training Step: 3199...  Training loss: 1.9133...  0.1345 sec/batch
Epoch: 11/15...  Training Step: 3200...  Training loss: 1.9225...  0.1356 sec/batch
Epoch: 11/15...  Training Step: 3201...  Training loss: 1.8998...  0.1359 sec/batch
Epoch: 11/15...  Training Step: 3202...  Training loss: 1.8989...  0.1350 sec/batch
Epoch: 11/15...  Training Step: 3203...  Training loss: 1.9096...  0.1351 sec/batch
Epoch: 11/15...  Training Step: 3204...  Training loss: 1.8955...  0.1346 se

Epoch: 11/15...  Training Step: 3291...  Training loss: 1.9131...  0.1351 sec/batch
Epoch: 11/15...  Training Step: 3292...  Training loss: 1.8635...  0.1353 sec/batch
Epoch: 11/15...  Training Step: 3293...  Training loss: 1.8910...  0.1324 sec/batch
Epoch: 11/15...  Training Step: 3294...  Training loss: 1.8781...  0.1309 sec/batch
Epoch: 11/15...  Training Step: 3295...  Training loss: 1.8816...  0.1333 sec/batch
Epoch: 11/15...  Training Step: 3296...  Training loss: 1.8910...  0.1342 sec/batch
Epoch: 11/15...  Training Step: 3297...  Training loss: 1.8882...  0.1342 sec/batch
Epoch: 11/15...  Training Step: 3298...  Training loss: 1.8878...  0.1312 sec/batch
Epoch: 11/15...  Training Step: 3299...  Training loss: 1.8976...  0.1318 sec/batch
Epoch: 11/15...  Training Step: 3300...  Training loss: 1.9128...  0.1326 sec/batch
Epoch: 11/15...  Training Step: 3301...  Training loss: 1.8731...  0.1333 sec/batch
Epoch: 11/15...  Training Step: 3302...  Training loss: 1.8745...  0.1338 se

Epoch: 12/15...  Training Step: 3389...  Training loss: 1.8911...  0.1348 sec/batch
Epoch: 12/15...  Training Step: 3390...  Training loss: 1.8818...  0.1345 sec/batch
Epoch: 12/15...  Training Step: 3391...  Training loss: 1.8929...  0.1342 sec/batch
Epoch: 12/15...  Training Step: 3392...  Training loss: 1.8748...  0.1334 sec/batch
Epoch: 12/15...  Training Step: 3393...  Training loss: 1.8651...  0.1319 sec/batch
Epoch: 12/15...  Training Step: 3394...  Training loss: 1.8708...  0.1355 sec/batch
Epoch: 12/15...  Training Step: 3395...  Training loss: 1.9035...  0.1309 sec/batch
Epoch: 12/15...  Training Step: 3396...  Training loss: 1.8784...  0.1338 sec/batch
Epoch: 12/15...  Training Step: 3397...  Training loss: 1.8990...  0.1314 sec/batch
Epoch: 12/15...  Training Step: 3398...  Training loss: 1.9131...  0.1304 sec/batch
Epoch: 12/15...  Training Step: 3399...  Training loss: 1.9327...  0.1332 sec/batch
Epoch: 12/15...  Training Step: 3400...  Training loss: 1.9113...  0.1339 se

Epoch: 12/15...  Training Step: 3487...  Training loss: 1.9129...  0.1317 sec/batch
Epoch: 12/15...  Training Step: 3488...  Training loss: 1.8894...  0.1344 sec/batch
Epoch: 12/15...  Training Step: 3489...  Training loss: 1.9030...  0.1353 sec/batch
Epoch: 12/15...  Training Step: 3490...  Training loss: 1.9068...  0.1343 sec/batch
Epoch: 12/15...  Training Step: 3491...  Training loss: 1.8992...  0.1325 sec/batch
Epoch: 12/15...  Training Step: 3492...  Training loss: 1.9089...  0.1336 sec/batch
Epoch: 12/15...  Training Step: 3493...  Training loss: 1.9032...  0.1325 sec/batch
Epoch: 12/15...  Training Step: 3494...  Training loss: 1.8685...  0.1352 sec/batch
Epoch: 12/15...  Training Step: 3495...  Training loss: 1.8827...  0.1314 sec/batch
Epoch: 12/15...  Training Step: 3496...  Training loss: 1.8890...  0.1340 sec/batch
Epoch: 12/15...  Training Step: 3497...  Training loss: 1.8950...  0.1353 sec/batch
Epoch: 12/15...  Training Step: 3498...  Training loss: 1.8812...  0.1321 se

Epoch: 12/15...  Training Step: 3585...  Training loss: 1.8931...  0.1309 sec/batch
Epoch: 12/15...  Training Step: 3586...  Training loss: 1.8734...  0.1309 sec/batch
Epoch: 12/15...  Training Step: 3587...  Training loss: 1.8760...  0.1355 sec/batch
Epoch: 12/15...  Training Step: 3588...  Training loss: 1.8661...  0.1330 sec/batch
Epoch: 12/15...  Training Step: 3589...  Training loss: 1.8752...  0.1354 sec/batch
Epoch: 12/15...  Training Step: 3590...  Training loss: 1.9052...  0.1321 sec/batch
Epoch: 12/15...  Training Step: 3591...  Training loss: 1.9282...  0.1320 sec/batch
Epoch: 12/15...  Training Step: 3592...  Training loss: 1.8709...  0.1347 sec/batch
Epoch: 12/15...  Training Step: 3593...  Training loss: 1.8911...  0.1352 sec/batch
Epoch: 12/15...  Training Step: 3594...  Training loss: 1.9014...  0.1330 sec/batch
Epoch: 12/15...  Training Step: 3595...  Training loss: 1.8526...  0.1352 sec/batch
Epoch: 12/15...  Training Step: 3596...  Training loss: 1.8845...  0.1320 se

Epoch: 13/15...  Training Step: 3683...  Training loss: 1.8702...  0.1336 sec/batch
Epoch: 13/15...  Training Step: 3684...  Training loss: 1.8810...  0.1342 sec/batch
Epoch: 13/15...  Training Step: 3685...  Training loss: 1.8702...  0.1326 sec/batch
Epoch: 13/15...  Training Step: 3686...  Training loss: 1.8578...  0.1340 sec/batch
Epoch: 13/15...  Training Step: 3687...  Training loss: 1.8683...  0.1348 sec/batch
Epoch: 13/15...  Training Step: 3688...  Training loss: 1.8838...  0.1349 sec/batch
Epoch: 13/15...  Training Step: 3689...  Training loss: 1.8804...  0.1311 sec/batch
Epoch: 13/15...  Training Step: 3690...  Training loss: 1.8746...  0.1344 sec/batch
Epoch: 13/15...  Training Step: 3691...  Training loss: 1.8773...  0.1341 sec/batch
Epoch: 13/15...  Training Step: 3692...  Training loss: 1.8812...  0.1343 sec/batch
Epoch: 13/15...  Training Step: 3693...  Training loss: 1.8789...  0.1324 sec/batch
Epoch: 13/15...  Training Step: 3694...  Training loss: 1.8862...  0.1330 se

Epoch: 13/15...  Training Step: 3781...  Training loss: 1.8631...  0.1332 sec/batch
Epoch: 13/15...  Training Step: 3782...  Training loss: 1.8980...  0.1332 sec/batch
Epoch: 13/15...  Training Step: 3783...  Training loss: 1.8898...  0.1360 sec/batch
Epoch: 13/15...  Training Step: 3784...  Training loss: 1.9069...  0.1319 sec/batch
Epoch: 13/15...  Training Step: 3785...  Training loss: 1.8952...  0.1350 sec/batch
Epoch: 13/15...  Training Step: 3786...  Training loss: 1.8735...  0.1367 sec/batch
Epoch: 13/15...  Training Step: 3787...  Training loss: 1.8995...  0.1326 sec/batch
Epoch: 13/15...  Training Step: 3788...  Training loss: 1.8955...  0.1357 sec/batch
Epoch: 13/15...  Training Step: 3789...  Training loss: 1.8892...  0.1315 sec/batch
Epoch: 13/15...  Training Step: 3790...  Training loss: 1.8978...  0.1350 sec/batch
Epoch: 13/15...  Training Step: 3791...  Training loss: 1.8843...  0.1355 sec/batch
Epoch: 13/15...  Training Step: 3792...  Training loss: 1.8956...  0.1351 se

Epoch: 13/15...  Training Step: 3879...  Training loss: 1.8687...  0.1299 sec/batch
Epoch: 13/15...  Training Step: 3880...  Training loss: 1.8972...  0.1362 sec/batch
Epoch: 13/15...  Training Step: 3881...  Training loss: 1.8726...  0.1347 sec/batch
Epoch: 13/15...  Training Step: 3882...  Training loss: 1.8593...  0.1308 sec/batch
Epoch: 13/15...  Training Step: 3883...  Training loss: 1.9028...  0.1338 sec/batch
Epoch: 13/15...  Training Step: 3884...  Training loss: 1.8573...  0.1340 sec/batch
Epoch: 13/15...  Training Step: 3885...  Training loss: 1.8878...  0.1324 sec/batch
Epoch: 13/15...  Training Step: 3886...  Training loss: 1.8931...  0.1343 sec/batch
Epoch: 13/15...  Training Step: 3887...  Training loss: 1.8648...  0.1344 sec/batch
Epoch: 13/15...  Training Step: 3888...  Training loss: 1.8813...  0.1335 sec/batch
Epoch: 13/15...  Training Step: 3889...  Training loss: 1.8760...  0.1337 sec/batch
Epoch: 13/15...  Training Step: 3890...  Training loss: 1.8539...  0.1314 se

Epoch: 14/15...  Training Step: 3977...  Training loss: 1.8675...  0.1316 sec/batch
Epoch: 14/15...  Training Step: 3978...  Training loss: 1.8818...  0.1338 sec/batch
Epoch: 14/15...  Training Step: 3979...  Training loss: 1.8924...  0.1321 sec/batch
Epoch: 14/15...  Training Step: 3980...  Training loss: 1.8852...  0.1338 sec/batch
Epoch: 14/15...  Training Step: 3981...  Training loss: 1.8773...  0.1339 sec/batch
Epoch: 14/15...  Training Step: 3982...  Training loss: 1.8937...  0.1314 sec/batch
Epoch: 14/15...  Training Step: 3983...  Training loss: 1.8610...  0.1342 sec/batch
Epoch: 14/15...  Training Step: 3984...  Training loss: 1.8828...  0.1339 sec/batch
Epoch: 14/15...  Training Step: 3985...  Training loss: 1.8697...  0.1333 sec/batch
Epoch: 14/15...  Training Step: 3986...  Training loss: 1.8646...  0.1320 sec/batch
Epoch: 14/15...  Training Step: 3987...  Training loss: 1.8736...  0.1341 sec/batch
Epoch: 14/15...  Training Step: 3988...  Training loss: 1.8575...  0.1328 se

Epoch: 14/15...  Training Step: 4075...  Training loss: 1.8726...  0.1364 sec/batch
Epoch: 14/15...  Training Step: 4076...  Training loss: 1.9153...  0.1329 sec/batch
Epoch: 14/15...  Training Step: 4077...  Training loss: 1.8882...  0.1354 sec/batch
Epoch: 14/15...  Training Step: 4078...  Training loss: 1.8915...  0.1365 sec/batch
Epoch: 14/15...  Training Step: 4079...  Training loss: 1.8797...  0.1375 sec/batch
Epoch: 14/15...  Training Step: 4080...  Training loss: 1.8593...  0.1333 sec/batch
Epoch: 14/15...  Training Step: 4081...  Training loss: 1.8489...  0.1345 sec/batch
Epoch: 14/15...  Training Step: 4082...  Training loss: 1.8591...  0.1372 sec/batch
Epoch: 14/15...  Training Step: 4083...  Training loss: 1.8755...  0.1331 sec/batch
Epoch: 14/15...  Training Step: 4084...  Training loss: 1.8533...  0.1343 sec/batch
Epoch: 14/15...  Training Step: 4085...  Training loss: 1.8949...  0.1353 sec/batch
Epoch: 14/15...  Training Step: 4086...  Training loss: 1.8865...  0.1341 se

Epoch: 14/15...  Training Step: 4173...  Training loss: 1.8951...  0.1319 sec/batch
Epoch: 14/15...  Training Step: 4174...  Training loss: 1.8731...  0.1344 sec/batch
Epoch: 14/15...  Training Step: 4175...  Training loss: 1.8718...  0.1347 sec/batch
Epoch: 14/15...  Training Step: 4176...  Training loss: 1.8927...  0.1344 sec/batch
Epoch: 14/15...  Training Step: 4177...  Training loss: 1.8897...  0.1356 sec/batch
Epoch: 14/15...  Training Step: 4178...  Training loss: 1.8802...  0.1351 sec/batch
Epoch: 14/15...  Training Step: 4179...  Training loss: 1.8926...  0.1320 sec/batch
Epoch: 14/15...  Training Step: 4180...  Training loss: 1.8935...  0.1324 sec/batch
Epoch: 14/15...  Training Step: 4181...  Training loss: 1.8750...  0.1340 sec/batch
Epoch: 14/15...  Training Step: 4182...  Training loss: 1.8564...  0.1321 sec/batch
Epoch: 14/15...  Training Step: 4183...  Training loss: 1.8848...  0.1335 sec/batch
Epoch: 14/15...  Training Step: 4184...  Training loss: 1.8691...  0.1307 se

Epoch: 15/15...  Training Step: 4271...  Training loss: 1.9118...  0.1326 sec/batch
Epoch: 15/15...  Training Step: 4272...  Training loss: 1.8797...  0.1321 sec/batch
Epoch: 15/15...  Training Step: 4273...  Training loss: 1.8890...  0.1323 sec/batch
Epoch: 15/15...  Training Step: 4274...  Training loss: 1.8863...  0.1322 sec/batch
Epoch: 15/15...  Training Step: 4275...  Training loss: 1.8797...  0.1333 sec/batch
Epoch: 15/15...  Training Step: 4276...  Training loss: 1.8830...  0.1337 sec/batch
Epoch: 15/15...  Training Step: 4277...  Training loss: 1.8719...  0.1311 sec/batch
Epoch: 15/15...  Training Step: 4278...  Training loss: 1.8611...  0.1346 sec/batch
Epoch: 15/15...  Training Step: 4279...  Training loss: 1.8888...  0.1333 sec/batch
Epoch: 15/15...  Training Step: 4280...  Training loss: 1.8564...  0.1353 sec/batch
Epoch: 15/15...  Training Step: 4281...  Training loss: 1.8779...  0.1356 sec/batch
Epoch: 15/15...  Training Step: 4282...  Training loss: 1.8784...  0.1345 se

Epoch: 15/15...  Training Step: 4369...  Training loss: 1.8830...  0.1303 sec/batch
Epoch: 15/15...  Training Step: 4370...  Training loss: 1.9132...  0.1330 sec/batch
Epoch: 15/15...  Training Step: 4371...  Training loss: 1.8883...  0.1335 sec/batch
Epoch: 15/15...  Training Step: 4372...  Training loss: 1.9092...  0.1311 sec/batch
Epoch: 15/15...  Training Step: 4373...  Training loss: 1.8948...  0.1367 sec/batch
Epoch: 15/15...  Training Step: 4374...  Training loss: 1.8918...  0.1323 sec/batch
Epoch: 15/15...  Training Step: 4375...  Training loss: 1.8896...  0.1359 sec/batch
Epoch: 15/15...  Training Step: 4376...  Training loss: 1.9138...  0.1358 sec/batch
Epoch: 15/15...  Training Step: 4377...  Training loss: 1.8921...  0.1322 sec/batch
Epoch: 15/15...  Training Step: 4378...  Training loss: 1.8591...  0.1320 sec/batch
Epoch: 15/15...  Training Step: 4379...  Training loss: 1.9070...  0.1312 sec/batch
Epoch: 15/15...  Training Step: 4380...  Training loss: 1.8769...  0.1350 se

Epoch: 15/15...  Training Step: 4467...  Training loss: 1.8740...  0.1302 sec/batch
Epoch: 15/15...  Training Step: 4468...  Training loss: 1.8872...  0.1349 sec/batch
Epoch: 15/15...  Training Step: 4469...  Training loss: 1.8764...  0.1302 sec/batch
Epoch: 15/15...  Training Step: 4470...  Training loss: 1.8720...  0.1344 sec/batch
Epoch: 15/15...  Training Step: 4471...  Training loss: 1.8923...  0.1359 sec/batch
Epoch: 15/15...  Training Step: 4472...  Training loss: 1.8982...  0.1339 sec/batch
Epoch: 15/15...  Training Step: 4473...  Training loss: 1.8585...  0.1331 sec/batch
Epoch: 15/15...  Training Step: 4474...  Training loss: 1.8733...  0.1341 sec/batch
Epoch: 15/15...  Training Step: 4475...  Training loss: 1.8654...  0.1346 sec/batch
Epoch: 15/15...  Training Step: 4476...  Training loss: 1.8868...  0.1339 sec/batch
Epoch: 15/15...  Training Step: 4477...  Training loss: 1.8674...  0.1319 sec/batch
Epoch: 15/15...  Training Step: 4478...  Training loss: 1.8645...  0.1353 se

The checkpoint file will save the trained model into a checkpoint file

tf.train.get_checkpoint_state('checkpoints')

### Sampling

Now that the network is trained, we'll can use it to generate new text. The idea is that we pass in a character, then the network will predict the next character. We can use the new one, to predict the next one. And we keep doing this to generate all new text. I also included some functionality to prime the network with some text by passing in a string and building up a state from that.

The network gives us predictions for each character. To reduce noise and make things a little less random, I'm going to only choose a new character from the top N most likely characters.

In [21]:
def pick_top_n(preds, vocab_size, top_n=5):
    p = np.squeeze(preds)
    p[np.argsort(p)[:-top_n]] = 0
    p = p / np.sum(p)
    c = np.random.choice(vocab_size, 1, p=p)[0]
    return c

In [22]:
def sample(checkpoint, n_samples, lstm_size, vocab_size, prime="The "):
    samples = [c for c in prime]
    model = CharRNN(len(vocab), lstm_size=lstm_size, sampling=True)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        saver.restore(sess, checkpoint)
        new_state = sess.run(model.initial_state)
        for c in prime:
            x = np.zeros((1, 1))
            x[0,0] = vocab_to_int[c]
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

        c = pick_top_n(preds, len(vocab))
        samples.append(int_to_vocab[c])

        for i in range(n_samples):
            x[0,0] = c
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

            c = pick_top_n(preds, len(vocab))
            samples.append(int_to_vocab[c])
        
    return ''.join(samples)

Here, pass in the path to a checkpoint and sample from the network.

In [23]:
tf.train.latest_checkpoint('checkpoints')

'checkpoints/i4545_l128.ckpt'

In [24]:
checkpoint = tf.train.latest_checkpoint('checkpoints')
samp = sample(checkpoint, 2000, lstm_size, len(vocab), prime="Far")
print(samp)

Faring in otherly. I advone to the the to become is and that's both. 

Work the sometimes our compentate it's a making, but I also subreddit is a mearer is chatalities."
"Thanking around that is so stand and would story the point and they seem instant talk incorses are than the percester is a poot and showed."
"It all the meaning as the most to a corlers in a roliconin a play a second or there and we are seans a contral way, as a crice as there to a busing is an and is stayel a price watch."
"Well was sumpout the peaser as with the tall only seem to said and in this on myself to aller all it the plantically antiman in the car a past and the sort is netter, and there was problems on the pretty and is stopped a steal and a most is a company.

Was tern outens to have to can soled of around to that solutions, this. 
"
"It were a count that also then and to said the too too took. I criced to consint that that are seen to the time and are still say it sublest someone, the partass of make out

In [26]:
checkpoint = 'checkpoints/i4545_l128.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Fary is a posted. The way is so serial are it and the crict things.

I are when I'm sort there or a class to comparers and assect that the poor.  I'm seen the parks, and season is to stop with the topically that the comment times of a since, then the plates the makes is so the tonal tarleries any poon they aren't couldn't have an everyone if it are an answer at any thing.

In a consing of the provine as a pollity in about the sort, but you had to completed a bot indeem and is assect and what you still be so should care to crost to chock a long of information titles and anyone in any something are are.   I as we start works out it the same can arancist with a pressing to be as wanting about any contring and we are a camplar at mise intermant issue and suck that and wat is already saying to the at of the consult that this was a person through what you'd be playing they was talking of the marking of an issue would be trurtange in a sen on that.

In me was police to and who could support, 

In [28]:
checkpoint = 'checkpoints/i4545_l128.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Fary.

And the took, I have been strength in to make on a match work in a point a barden other and there wasn't the comment was.  I don't. Also the seant in the percension was thinked to all it, and that this warranning to a point and in the same compect and sterears isn't subreddit women such on those there to an to the sound though if you completely some are an one. That's no solating a castary working to start. It subreddit. The waining on the same to try in the time which is around this was pateded are assume of the sound of it in a cars on a clessions. 

I have no tanters of a chertational."" 
"
"It's tash is the prices in an insure of the second as to to compense into anything in the say a processive is asked. I can't could a cricer inclused araline their sering, because their through wouldn't charable out arans of this tand there of the sarical. 

That's all some there to think to the chense on all that this to straight that the prampting it.  

Also if you crangion ats a poor w

In [29]:
checkpoint = 'checkpoints/i4545_l128.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Fars taken and shit the thing all it's all that were to check the track at a things. It taken it as anything and we ack if the trardening as to some place the save and to be an this thing that there are perfect over other of the clas and all that work of as the prisaration want a level. This story it that's strikge and their anternalitin on other streallys it isn stop the marn indesical or is it to be ward of in modest tarting on my to trast and a prove to be allowing, but I'm thread to be someone someto a part and a care tha laine armer of the completely sayed.  If it's betually a sideline of aranching on the past that watch any still sure a been that is that's people. I'm sure it's a completely in the thing well on it are around the side. The side tark to the countinas in other time. If it time to and they can an incompletely a people in the something of it. I wanted to be a classing it and a controt times that someone.
"I have not it. I would be to the conclidener on is troat this i

In [30]:
checkpoint = 'checkpoints/i4545_l128.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Farered to the saying are storage. 

I and a latter oun and an encompot and is the this to be that the count astall of the strong and, it, are sound to could be that your server is not the peepleding or the past of manation was around a like to seem to complate in my one that should bad in a so the ting that in the possessional assuming of a more takes and the then one than a poor to a lest and are steating and anynger as the creates. 

Woller triss to be to a let or ser the too sounds and sound a can that the compates also were arain tool with oth are then what it are and trying indonding on my surreed and this seen in the plails, they actually term as the check of the trives."
"You don't think you want to sumparity to the same also, we have been something are contuct a starten to any assing the per though out a compromed to tratival into summer that try to any there in the sereasion. And aren't an accesse is not inferers about the sentinator. It and that if you are and to treatian al