Revised and fixed code from https://github.com/JayParks/transformer (MT) for LM

Stripped the code from the JayParks repo for MT Transformer. Introduced a few updates and changes for speed, but it's still frustratingly slow. Possible improvement - speed it up.

Another issue - hyperparameter search for language modelling (number of heads, number of self-attention layers, etc). Does not work well from the box. This might be of help https://arxiv.org/pdf/1804.00247.pdf.

Also consider parallelizing.

# TODO

* Clean up
* Add MoS

# Sentence-wise batching

This version of Transformer LM uses sentence-wise batching (each sentence is an inpdependent example).

**NB** Before running make sure src code accounts for PAD.

In [4]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="7"

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from tqdm import tqdm
from showprogress import *

import torch
# torch.cuda.device(0)
import torch.nn as nn
import torch.optim as optim

from torch.nn.utils.rnn import pack_padded_sequence as pack
from torch.nn.utils.rnn import pad_packed_sequence as pad
from torch.nn.utils import clip_grad_norm_ as clip
from torch.optim.lr_scheduler import StepLR

import const
from data import *
from transformer import *

In [2]:
ptb_datapath_train = 'data/penn/train.txt'
ptb_datapath_valid = 'data/penn/valid.txt'
ptb_datapath_test = 'data/penn/test.txt'

batch_size = 128

ptb_train = DataSet(ptb_datapath_train, batch_size, display_freq=0, max_len=90, trunc_len=90)
ptb_valid = DataSet(ptb_datapath_valid, batch_size, display_freq=0, max_len=90, trunc_len=90)
ptb_test = DataSet(ptb_datapath_test, batch_size, display_freq=0, max_len=90, trunc_len=90)

Loading data from data/penn/train.txt ...
Loading data from data/penn/valid.txt ...
Loading data from data/penn/test.txt ...


In [3]:
ptb_train.build_dict()
ptb_valid.change_dict(ptb_train.dictionary)
ptb_test.change_dict(ptb_train.dictionary)

Building dictionary...
Done.
Save dictionary at data/penn/train.txt.dict
Index tokens ...
42068 sentences were processed, 0 longer than maximum length,0 were ignored because zero length
Data discription:
Data name : data/penn/train.txt
Number of sentence : 42068
Number of tokens : 887521
Vocabulary size : 10000
Number of batches : 328
Batch size : 128
Done.
Index tokens ...
3370 sentences were processed, 0 longer than maximum length,0 were ignored because zero length
Data discription:
Data name : data/penn/valid.txt
Number of sentence : 3370
Number of tokens : 70390
Vocabulary size : 10000
Number of batches : 26
Batch size : 128
Done.
Index tokens ...
3761 sentences were processed, 0 longer than maximum length,0 were ignored because zero length
Data discription:
Data name : data/penn/test.txt
Number of sentence : 3761
Number of tokens : 78669
Vocabulary size : 10000
Number of batches : 29
Batch size : 128
Done.


In [6]:
voc_size = ptb_train.num_vocb
emb_dim = 512
d_k = 64
d_v = 64
n_layers = 2
n_heads = 4
d_ff = 2048
max_tgt_seq_len = 90
dropout = 0.1
weighted_model = False
share_proj_weight = True
lr = 1e-6
n_epochs = 100
clip_grad = 5
warmup_steps = 2000

In [10]:
model = LMTransformer(n_layers, d_k, d_v, emb_dim, d_ff,
                      n_heads, max_tgt_seq_len, voc_size,
                      dropout, weighted_model, share_proj_weight)
criterion = nn.CrossEntropyLoss(ignore_index=const.PAD)

if torch.cuda.is_available():
    model = model.cuda()
    criterion = criterion.cuda()

#opt = optim.Adam(model.trainable_params(), lr=lr)
# lr_lambda = lambda epoch: 0.99 ** epoch
#lrsched = StepLR(opt, step_size=10, gamma=0.5)

Sharing target embedding and projection..


In [11]:
torch.cuda.is_available()

True

In [12]:
opt = optim.Adam(model.trainable_params(),betas=(0.9, 0.98), eps=1e-09, lr=lr)
i=0
for epoch in irange(n_epochs):
    #lrsched.step()
    acc_loss = 0
    print('Start epoch %d, learning rate %f '%(epoch + 1, opt.state_dict()['param_groups'][0]['lr']))
    start_time = time.time()
    model.train()
    ptb_train.shuffle()
    for batch_idx in irange(ptb_train.num_batch):
        data, lengths, target = ptb_train.get_batch(batch_idx)
        opt.zero_grad()
        output, self_attn = model.forward(data, lengths)

        loss = criterion(output, target.view(-1))
        loss.backward()
        opt.step()
        acc_loss += loss.item()
        i+=1
        new_lr = np.power(emb_dim, -0.5) * np.min([
            np.power((i), -0.5),
            np.power(warmup_steps, -1.5) * (i)])
        for param_group in opt.param_groups:
            param_group['lr'] = new_lr
        
    avg_loss = acc_loss / ptb_train.num_batch
    print('Epoch : %d, Batch : %d / %d, Loss : %f, Perplexity : %f, Time : %f' 
          % (epoch + 1, batch_idx, ptb_train.num_batch,
             avg_loss, math.exp(avg_loss),
             time.time() - start_time))

    acc_loss = 0
    model.eval()
    for batch_idx in irange(ptb_test.num_batch):
        data, lengths, target = ptb_test[batch_idx]
        output, self_attn = model.forward(data, lengths)
        loss = criterion(output, target.view(-1))
        acc_loss += loss.item()

    val_loss = acc_loss / ptb_test.num_batch
    print('Validation Loss : %f' % val_loss)
    print('Validation Perplexity : %f' % math.exp(val_loss))

Start epoch 1, learning rate 0.000001 
2


Epoch : 1, Batch : 327 / 328, Loss : 7.094643, Perplexity : 1205.491824, Time : 26.984499


Validation Loss : 5.923161
Validation Perplexity : 373.590941
Start epoch 2, learning rate 0.000162 
2


Epoch : 2, Batch : 327 / 328, Loss : 5.652512, Perplexity : 285.006433, Time : 26.863564


Validation Loss : 5.362131
Validation Perplexity : 213.178737
Start epoch 3, learning rate 0.000324 
2


Epoch : 3, Batch : 327 / 328, Loss : 5.190272, Perplexity : 179.517309, Time : 26.909928


Validation Loss : 5.091826
Validation Perplexity : 162.686632
Start epoch 4, learning rate 0.000486 
2


Epoch : 4, Batch : 327 / 328, Loss : 4.890518, Perplexity : 133.022409, Time : 26.720397


Validation Loss : 4.964463
Validation Perplexity : 143.231570
Start epoch 5, learning rate 0.000648 
2


Epoch : 5, Batch : 327 / 328, Loss : 4.664119, Perplexity : 106.072129, Time : 26.764846


Validation Loss : 4.891808
Validation Perplexity : 133.194198
Start epoch 6, learning rate 0.000810 
2


Epoch : 6, Batch : 327 / 328, Loss : 4.479422, Perplexity : 88.183708, Time : 26.839649


Validation Loss : 4.874202
Validation Perplexity : 130.869646
Start epoch 7, learning rate 0.000972 
2


Epoch : 7, Batch : 327 / 328, Loss : 4.306205, Perplexity : 74.158537, Time : 26.968282


Validation Loss : 4.849750
Validation Perplexity : 127.708483
Start epoch 8, learning rate 0.000922 
2


Epoch : 8, Batch : 327 / 328, Loss : 4.104188, Perplexity : 60.593548, Time : 27.066830


Validation Loss : 4.854182
Validation Perplexity : 128.275685
Start epoch 9, learning rate 0.000863 
2


Epoch : 9, Batch : 327 / 328, Loss : 3.924664, Perplexity : 50.636052, Time : 27.156511


Validation Loss : 4.892143
Validation Perplexity : 133.238832
Start epoch 10, learning rate 0.000813 
2


Epoch : 10, Batch : 327 / 328, Loss : 3.763871, Perplexity : 43.114997, Time : 27.238722


Validation Loss : 4.916912
Validation Perplexity : 136.580269
Start epoch 11, learning rate 0.000772 
2


Epoch : 11, Batch : 327 / 328, Loss : 3.619289, Perplexity : 37.311012, Time : 27.229854


Validation Loss : 4.954940
Validation Perplexity : 141.874046
Start epoch 12, learning rate 0.000736 
2


Epoch : 12, Batch : 327 / 328, Loss : 3.487296, Perplexity : 32.697401, Time : 27.129731


Validation Loss : 5.014236
Validation Perplexity : 150.541132
Start epoch 13, learning rate 0.000704 
2


Epoch : 13, Batch : 327 / 328, Loss : 3.365137, Perplexity : 28.937452, Time : 27.084126


Validation Loss : 5.091677
Validation Perplexity : 162.662390
Start epoch 14, learning rate 0.000677 
2


Epoch : 14, Batch : 327 / 328, Loss : 3.252177, Perplexity : 25.846550, Time : 27.125109


Validation Loss : 5.143655
Validation Perplexity : 171.340891
Start epoch 15, learning rate 0.000652 
2


Epoch : 15, Batch : 327 / 328, Loss : 3.151060, Perplexity : 23.360821, Time : 27.142206


Validation Loss : 5.197637
Validation Perplexity : 180.844481
Start epoch 16, learning rate 0.000630 
2


Epoch : 16, Batch : 327 / 328, Loss : 3.056220, Perplexity : 21.247094, Time : 27.132669


Validation Loss : 5.248883
Validation Perplexity : 190.353544
Start epoch 17, learning rate 0.000610 
2


Epoch : 17, Batch : 327 / 328, Loss : 2.970233, Perplexity : 19.496454, Time : 27.081261


Validation Loss : 5.308233
Validation Perplexity : 201.992997
Start epoch 18, learning rate 0.000592 
2


Epoch : 18, Batch : 327 / 328, Loss : 2.891288, Perplexity : 18.016504, Time : 27.247175


Validation Loss : 5.378051
Validation Perplexity : 216.599740
Start epoch 19, learning rate 0.000575 
2


Epoch : 19, Batch : 327 / 328, Loss : 2.819584, Perplexity : 16.769868, Time : 27.090880


Validation Loss : 5.448691
Validation Perplexity : 232.453686
Start epoch 20, learning rate 0.000560 
2


Epoch : 20, Batch : 327 / 328, Loss : 2.752951, Perplexity : 15.688862, Time : 27.009162


Validation Loss : 5.488341
Validation Perplexity : 241.855636
Start epoch 21, learning rate 0.000546 
2


Epoch : 21, Batch : 327 / 328, Loss : 2.688037, Perplexity : 14.702790, Time : 27.252764


Validation Loss : 5.538740
Validation Perplexity : 254.357339
Start epoch 22, learning rate 0.000532 
2


Epoch : 22, Batch : 327 / 328, Loss : 2.631624, Perplexity : 13.896314, Time : 27.195158


Validation Loss : 5.590573
Validation Perplexity : 267.889181
Start epoch 23, learning rate 0.000520 
2


Epoch : 23, Batch : 327 / 328, Loss : 2.577792, Perplexity : 13.168024, Time : 27.125402


Validation Loss : 5.642174
Validation Perplexity : 282.075260
Start epoch 24, learning rate 0.000509 
2


Epoch : 24, Batch : 327 / 328, Loss : 2.529212, Perplexity : 12.543618, Time : 27.209643


Validation Loss : 5.681014
Validation Perplexity : 293.246591
Start epoch 25, learning rate 0.000498 
2


Epoch : 25, Batch : 327 / 328, Loss : 2.480861, Perplexity : 11.951548, Time : 27.151124


Validation Loss : 5.725553
Validation Perplexity : 306.602645
Start epoch 26, learning rate 0.000488 
2


Epoch : 26, Batch : 327 / 328, Loss : 2.439221, Perplexity : 11.464106, Time : 27.270407


Validation Loss : 5.772868
Validation Perplexity : 321.458384
Start epoch 27, learning rate 0.000479 
2


Epoch : 27, Batch : 327 / 328, Loss : 2.399350, Perplexity : 11.016019, Time : 27.143059


Validation Loss : 5.826759
Validation Perplexity : 339.257382
Start epoch 28, learning rate 0.000470 
2


Epoch : 28, Batch : 327 / 328, Loss : 2.359158, Perplexity : 10.582040, Time : 26.970661


Validation Loss : 5.863462
Validation Perplexity : 351.940596
Start epoch 29, learning rate 0.000461 
2


Epoch : 29, Batch : 327 / 328, Loss : 2.326607, Perplexity : 10.243132, Time : 27.187175


Validation Loss : 5.901393
Validation Perplexity : 365.546374
Start epoch 30, learning rate 0.000453 
2


Epoch : 30, Batch : 327 / 328, Loss : 2.292824, Perplexity : 9.902864, Time : 27.371835


Validation Loss : 5.932038
Validation Perplexity : 376.921989
Start epoch 31, learning rate 0.000446 
2


Epoch : 31, Batch : 327 / 328, Loss : 2.258217, Perplexity : 9.566013, Time : 26.966738


Validation Loss : 5.977356
Validation Perplexity : 394.396318
Start epoch 32, learning rate 0.000438 
2


Epoch : 32, Batch : 327 / 328, Loss : 2.232004, Perplexity : 9.318523, Time : 27.159499


Validation Loss : 6.001063
Validation Perplexity : 403.858020
Start epoch 33, learning rate 0.000431 
2


Epoch : 33, Batch : 327 / 328, Loss : 2.200691, Perplexity : 9.031249, Time : 27.080312


Validation Loss : 6.053239
Validation Perplexity : 425.489050
Start epoch 34, learning rate 0.000425 
2


Epoch : 34, Batch : 327 / 328, Loss : 2.175237, Perplexity : 8.804269, Time : 27.198309


Validation Loss : 6.078005
Validation Perplexity : 436.158263
Start epoch 35, learning rate 0.000418 
2


Epoch : 35, Batch : 327 / 328, Loss : 2.148704, Perplexity : 8.573738, Time : 27.226201


Validation Loss : 6.113234
Validation Perplexity : 451.797465
Start epoch 36, learning rate 0.000412 
2


Epoch : 36, Batch : 327 / 328, Loss : 2.125612, Perplexity : 8.378024, Time : 27.156723


Validation Loss : 6.138633
Validation Perplexity : 463.419538
Start epoch 37, learning rate 0.000407 
2


Epoch : 37, Batch : 327 / 328, Loss : 2.102217, Perplexity : 8.184292, Time : 27.262805


Validation Loss : 6.168295
Validation Perplexity : 477.371676
Start epoch 38, learning rate 0.000401 
2


Epoch : 38, Batch : 327 / 328, Loss : 2.078866, Perplexity : 7.995399, Time : 27.146727


Validation Loss : 6.213635
Validation Perplexity : 499.513737
Start epoch 39, learning rate 0.000396 
2


Epoch : 39, Batch : 327 / 328, Loss : 2.055921, Perplexity : 7.814034, Time : 27.178237


Validation Loss : 6.224810
Validation Perplexity : 505.127047
Start epoch 40, learning rate 0.000391 
2


Epoch : 40, Batch : 327 / 328, Loss : 2.037631, Perplexity : 7.672410, Time : 26.766756


Validation Loss : 6.255915
Validation Perplexity : 521.085959
Start epoch 41, learning rate 0.000386 
2


Epoch : 41, Batch : 327 / 328, Loss : 2.017813, Perplexity : 7.521858, Time : 26.833324


Validation Loss : 6.276245
Validation Perplexity : 531.787986
Start epoch 42, learning rate 0.000381 
2


Epoch : 42, Batch : 327 / 328, Loss : 1.998812, Perplexity : 7.380281, Time : 26.834451


Validation Loss : 6.307731
Validation Perplexity : 548.798196
Start epoch 43, learning rate 0.000377 
2


Epoch : 43, Batch : 327 / 328, Loss : 1.981641, Perplexity : 7.254636, Time : 26.934371


Validation Loss : 6.329274
Validation Perplexity : 560.749312
Start epoch 44, learning rate 0.000372 
2


Epoch : 44, Batch : 327 / 328, Loss : 1.966072, Perplexity : 7.142566, Time : 26.866119


Validation Loss : 6.347596
Validation Perplexity : 571.118242
Start epoch 45, learning rate 0.000368 
2


Epoch : 45, Batch : 327 / 328, Loss : 1.947448, Perplexity : 7.010775, Time : 27.118426


Validation Loss : 6.383005
Validation Perplexity : 591.703093
Start epoch 46, learning rate 0.000364 
2


Epoch : 46, Batch : 327 / 328, Loss : 1.928733, Perplexity : 6.880787, Time : 27.092984


Validation Loss : 6.407354
Validation Perplexity : 606.287576
Start epoch 47, learning rate 0.000360 
2


Epoch : 47, Batch : 327 / 328, Loss : 1.914071, Perplexity : 6.780635, Time : 27.167691


Validation Loss : 6.450402
Validation Perplexity : 632.956896
Start epoch 48, learning rate 0.000356 
2


Epoch : 48, Batch : 327 / 328, Loss : 1.899156, Perplexity : 6.680251, Time : 27.283344


Validation Loss : 6.454909
Validation Perplexity : 635.815757
Start epoch 49, learning rate 0.000352 
2


Epoch : 49, Batch : 327 / 328, Loss : 1.885317, Perplexity : 6.588446, Time : 27.084816


Validation Loss : 6.466480
Validation Perplexity : 643.215931
Start epoch 50, learning rate 0.000349 
2


Epoch : 50, Batch : 327 / 328, Loss : 1.868610, Perplexity : 6.479283, Time : 27.006098


Validation Loss : 6.493697
Validation Perplexity : 660.962431
Start epoch 51, learning rate 0.000345 
2


Epoch : 51, Batch : 327 / 328, Loss : 1.857038, Perplexity : 6.404735, Time : 27.059994


Validation Loss : 6.512224
Validation Perplexity : 673.321990
Start epoch 52, learning rate 0.000342 
2


Epoch : 52, Batch : 327 / 328, Loss : 1.843495, Perplexity : 6.318581, Time : 27.155902


Validation Loss : 6.543720
Validation Perplexity : 694.866656
Start epoch 53, learning rate 0.000338 
2


Epoch : 53, Batch : 327 / 328, Loss : 1.831606, Perplexity : 6.243909, Time : 26.960401


Validation Loss : 6.567617
Validation Perplexity : 711.671887
Start epoch 54, learning rate 0.000335 
2


Epoch : 54, Batch : 327 / 328, Loss : 1.818280, Perplexity : 6.161253, Time : 27.133557


Validation Loss : 6.573815
Validation Perplexity : 716.096630
Start epoch 55, learning rate 0.000332 
2


Epoch : 55, Batch : 327 / 328, Loss : 1.805920, Perplexity : 6.085566, Time : 27.120699


Validation Loss : 6.602375
Validation Perplexity : 736.842811
Start epoch 56, learning rate 0.000329 
2


Epoch : 56, Batch : 327 / 328, Loss : 1.792561, Perplexity : 6.004809, Time : 27.061006


Validation Loss : 6.628276
Validation Perplexity : 756.177026
Start epoch 57, learning rate 0.000326 
2


Epoch : 57, Batch : 327 / 328, Loss : 1.784615, Perplexity : 5.957286, Time : 26.814238


Validation Loss : 6.636161
Validation Perplexity : 762.163385
Start epoch 58, learning rate 0.000323 
2


Epoch : 58, Batch : 327 / 328, Loss : 1.774225, Perplexity : 5.895712, Time : 26.987365


Validation Loss : 6.658803
Validation Perplexity : 779.617486
Start epoch 59, learning rate 0.000320 
2


Epoch : 59, Batch : 327 / 328, Loss : 1.761852, Perplexity : 5.823209, Time : 27.043524


Validation Loss : 6.673084
Validation Perplexity : 790.830597
Start epoch 60, learning rate 0.000318 
2


Epoch : 60, Batch : 327 / 328, Loss : 1.751658, Perplexity : 5.764153, Time : 27.086851


Validation Loss : 6.697337
Validation Perplexity : 810.245415
Start epoch 61, learning rate 0.000315 
2


Epoch : 61, Batch : 327 / 328, Loss : 1.739810, Perplexity : 5.696259, Time : 27.257638


Validation Loss : 6.714050
Validation Perplexity : 823.901093
Start epoch 62, learning rate 0.000312 
2


Epoch : 62, Batch : 327 / 328, Loss : 1.731422, Perplexity : 5.648682, Time : 27.092442


Validation Loss : 6.722033
Validation Perplexity : 830.503972
Start epoch 63, learning rate 0.000310 
2


Epoch : 63, Batch : 327 / 328, Loss : 1.721388, Perplexity : 5.592286, Time : 26.928650


Validation Loss : 6.730752
Validation Perplexity : 837.777140
Start epoch 64, learning rate 0.000307 
2


Epoch : 64, Batch : 327 / 328, Loss : 1.711774, Perplexity : 5.538778, Time : 26.810250


Validation Loss : 6.754939
Validation Perplexity : 858.287701
Start epoch 65, learning rate 0.000305 
2


Epoch : 65, Batch : 327 / 328, Loss : 1.701918, Perplexity : 5.484458, Time : 27.013523


Validation Loss : 6.762844
Validation Perplexity : 865.098704
Start epoch 66, learning rate 0.000303 
2


Epoch : 66, Batch : 327 / 328, Loss : 1.692734, Perplexity : 5.434317, Time : 27.018415


Validation Loss : 6.781523
Validation Perplexity : 881.410407
Start epoch 67, learning rate 0.000300 
2


Epoch : 67, Batch : 327 / 328, Loss : 1.685287, Perplexity : 5.394000, Time : 26.951969


Validation Loss : 6.804494
Validation Perplexity : 901.891344
Start epoch 68, learning rate 0.000298 
2


Epoch : 68, Batch : 327 / 328, Loss : 1.676096, Perplexity : 5.344649, Time : 26.752085


Validation Loss : 6.819672
Validation Perplexity : 915.684760
Start epoch 69, learning rate 0.000296 
2


Epoch : 69, Batch : 327 / 328, Loss : 1.666975, Perplexity : 5.296120, Time : 27.002089


Validation Loss : 6.825963
Validation Perplexity : 921.463047
Start epoch 70, learning rate 0.000294 
2


Epoch : 70, Batch : 327 / 328, Loss : 1.659649, Perplexity : 5.257465, Time : 26.951808


Validation Loss : 6.856897
Validation Perplexity : 950.413738
Start epoch 71, learning rate 0.000292 
2


Epoch : 71, Batch : 327 / 328, Loss : 1.653279, Perplexity : 5.224081, Time : 26.854233


Validation Loss : 6.859000
Validation Perplexity : 952.413715
Start epoch 72, learning rate 0.000290 
2


Epoch : 72, Batch : 327 / 328, Loss : 1.645579, Perplexity : 5.184009, Time : 26.881393


Validation Loss : 6.861928
Validation Perplexity : 955.207139
Start epoch 73, learning rate 0.000288 
2


Epoch : 73, Batch : 327 / 328, Loss : 1.636548, Perplexity : 5.137406, Time : 26.891830


Validation Loss : 6.877216
Validation Perplexity : 969.922073
Start epoch 74, learning rate 0.000286 
2


Epoch : 74, Batch : 327 / 328, Loss : 1.629119, Perplexity : 5.099382, Time : 26.933552


Validation Loss : 6.883338
Validation Perplexity : 975.878547
Start epoch 75, learning rate 0.000284 
2


Epoch : 75, Batch : 327 / 328, Loss : 1.623339, Perplexity : 5.069989, Time : 26.855798


Validation Loss : 6.918983
Validation Perplexity : 1011.290888
Start epoch 76, learning rate 0.000282 
2


Epoch : 76, Batch : 327 / 328, Loss : 1.612891, Perplexity : 5.017295, Time : 26.797453


Validation Loss : 6.917616
Validation Perplexity : 1009.909886
Start epoch 77, learning rate 0.000280 
2


Epoch : 77, Batch : 327 / 328, Loss : 1.606956, Perplexity : 4.987605, Time : 27.049896


Validation Loss : 6.933735
Validation Perplexity : 1026.319689
Start epoch 78, learning rate 0.000278 
2


Epoch : 78, Batch : 327 / 328, Loss : 1.600161, Perplexity : 4.953831, Time : 27.194089


Validation Loss : 6.941503
Validation Perplexity : 1034.323226
Start epoch 79, learning rate 0.000276 
2


Epoch : 79, Batch : 327 / 328, Loss : 1.594091, Perplexity : 4.923849, Time : 27.032208


Validation Loss : 6.950073
Validation Perplexity : 1043.225688
Start epoch 80, learning rate 0.000275 
2


Epoch : 80, Batch : 327 / 328, Loss : 1.587207, Perplexity : 4.890074, Time : 27.222322


Validation Loss : 6.985729
Validation Perplexity : 1081.094448
Start epoch 81, learning rate 0.000273 
2


Epoch : 81, Batch : 327 / 328, Loss : 1.580025, Perplexity : 4.855079, Time : 26.994811


Validation Loss : 6.995927
Validation Perplexity : 1092.175140
Start epoch 82, learning rate 0.000271 
2


Epoch : 82, Batch : 327 / 328, Loss : 1.574149, Perplexity : 4.826633, Time : 26.836708


Validation Loss : 6.987481
Validation Perplexity : 1082.989828
Start epoch 83, learning rate 0.000269 
2


Epoch : 83, Batch : 327 / 328, Loss : 1.568231, Perplexity : 4.798152, Time : 26.900294


Validation Loss : 7.003741
Validation Perplexity : 1100.742968
Start epoch 84, learning rate 0.000268 
2


Epoch : 84, Batch : 327 / 328, Loss : 1.560719, Perplexity : 4.762243, Time : 27.157232


Validation Loss : 7.014980
Validation Perplexity : 1113.184753
Start epoch 85, learning rate 0.000266 
2


Epoch : 85, Batch : 327 / 328, Loss : 1.558863, Perplexity : 4.753415, Time : 26.750934


Validation Loss : 7.017495
Validation Perplexity : 1115.987940
Start epoch 86, learning rate 0.000265 
2


Epoch : 86, Batch : 327 / 328, Loss : 1.549592, Perplexity : 4.709548, Time : 26.836635


Validation Loss : 7.024148
Validation Perplexity : 1123.437486
Start epoch 87, learning rate 0.000263 
2


Epoch : 87, Batch : 327 / 328, Loss : 1.542616, Perplexity : 4.676808, Time : 26.812194


Validation Loss : 7.052453
Validation Perplexity : 1155.690414
Start epoch 88, learning rate 0.000262 
2


Epoch : 88, Batch : 327 / 328, Loss : 1.538744, Perplexity : 4.658735, Time : 26.938578


Validation Loss : 7.064746
Validation Perplexity : 1169.985240
Start epoch 89, learning rate 0.000260 
2


Epoch : 89, Batch : 327 / 328, Loss : 1.533373, Perplexity : 4.633782, Time : 26.760776


Validation Loss : 7.066088
Validation Perplexity : 1171.555587
Start epoch 90, learning rate 0.000259 
2


Epoch : 90, Batch : 327 / 328, Loss : 1.528406, Perplexity : 4.610820, Time : 26.750903


Validation Loss : 7.081171
Validation Perplexity : 1189.359900
Start epoch 91, learning rate 0.000257 
2


Epoch : 91, Batch : 327 / 328, Loss : 1.522421, Perplexity : 4.583309, Time : 26.761417


Validation Loss : 7.082516
Validation Perplexity : 1190.960795
Start epoch 92, learning rate 0.000256 
2


Epoch : 92, Batch : 327 / 328, Loss : 1.516350, Perplexity : 4.555565, Time : 26.780004


Validation Loss : 7.108546
Validation Perplexity : 1222.369344
Start epoch 93, learning rate 0.000254 
2


Epoch : 93, Batch : 327 / 328, Loss : 1.514387, Perplexity : 4.546635, Time : 26.818626


Validation Loss : 7.129600
Validation Perplexity : 1248.377720
Start epoch 94, learning rate 0.000253 
2


Epoch : 94, Batch : 327 / 328, Loss : 1.504688, Perplexity : 4.502748, Time : 27.108105


Validation Loss : 7.121537
Validation Perplexity : 1238.352414
Start epoch 95, learning rate 0.000252 
2


Epoch : 95, Batch : 327 / 328, Loss : 1.501635, Perplexity : 4.489021, Time : 26.991120


Validation Loss : 7.129529
Validation Perplexity : 1248.289499
Start epoch 96, learning rate 0.000250 
2


Epoch : 96, Batch : 327 / 328, Loss : 1.494709, Perplexity : 4.458040, Time : 26.802150


Validation Loss : 7.145844
Validation Perplexity : 1268.821278
Start epoch 97, learning rate 0.000249 
2


Epoch : 97, Batch : 327 / 328, Loss : 1.492902, Perplexity : 4.449989, Time : 26.865038


Validation Loss : 7.147723
Validation Perplexity : 1271.208305
Start epoch 98, learning rate 0.000248 
2


KeyboardInterrupt: 