<a href="https://colab.research.google.com/github/mitkrieg/dl-assignment-2/blob/main/assignment2_practical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 5787 Deep Learning Assignment 2

This notebook implements the "small" LSTM model as described in "Recurrent Neural Network Regularization" by Zaremba et al (2014).

## Initial Setup

### Install Weights & Biases

In [2]:
!pip install wandb
!wandb login

Collecting wandb
  Downloading wandb-0.18.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting gitpython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.43-py3-none-any.whl.metadata (13 kB)
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-2.14.0-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.9 kB)
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.29,>=1.0.0->wandb)
  Downloading gitdb-4.0.11-py3-none-any.whl.metadata (1.2 kB)
Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython!=3.1.29,>=1.0.0->wandb)
  Downloading smmap-5.0.1-py3-none-any.whl.metadata (4.3 kB)
Downloading wandb-0.18.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_

### Imports & GPU Check

In [3]:
import torch
from torch import nn
from torch.utils.data import Dataset
from torch import optim
import torch.nn.functional as F
import math
import wandb

torch.manual_seed(123)
torch.cuda.manual_seed(123)
torch.cuda.manual_seed_all(123)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

print("------ ACCELERATION INFO -----")
print('CUDA GPU Available:',torch.cuda.is_available())
print('MPS GPU Available:', torch.backends.mps.is_available())
if torch.cuda.is_available():
  device = torch.device('cuda')
  print('GPU Name:',torch.cuda.get_device_name(0))
  print('GPU Count:',torch.cuda.device_count())
  print('GPU Memory Allocated:',torch.cuda.memory_allocated(0))
  print('GPU Memory Cached:',torch.cuda.memory_reserved(0))
# elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
#   device = torch.device('mps')
#   print('Pytorch GPU Build:',torch.backends.mps.is_built())
else:
  device = torch.device('cpu')
  print('Using CPU')

------ ACCELERATION INFO -----
CUDA GPU Available: True
MPS GPU Available: False
GPU Name: Tesla T4
GPU Count: 1
GPU Memory Allocated: 0
GPU Memory Cached: 0


## Vocabulary & PTBText Dataset Classes

Parse data from raw files & create dataset class to interact with

In [5]:
class Vocab:
    def __init__(self, pre_built_dict: dict=None):
        if pre_built_dict:
            self.vocab = pre_built_dict
        else:
            self.vocab = {'<pad>': 0, '<oov>': 1, '<sos>': 2, '<eos>': 3, '<unk>': 4}
        self.idx = len(self.vocab)

    def add_word(self, word: str) -> None:
        if word not in self.vocab:
            self.vocab[word] = self.idx
            self.idx += 1

    def encode(self, tokens: list[str]) -> list[int]:
        return [self.vocab.get(word, self.vocab['<unk>']) for word in tokens]

    def decode(self, indicies: list[int]) -> list[str]:
        return [list(self.vocab.keys())[list(self.vocab.values()).index(idx)] for idx in indicies]

    def __len__(self):
        return len(self.vocab)


class PTBText(Dataset):
    def __init__(self, path: str, vocab: Vocab=Vocab(), build_vocab=True, batch_size=20, seqence_length=20, device=torch.device('cpu')):
        self.path = path
        self.device = device
        self.vocab = vocab
        self.data = self.load_data(build_vocab)
        self.batch_size = batch_size
        self.chunk_size = len(self.data) // batch_size
        self.seq_len = seqence_length
        self.minibatches = self.create_batches()

    def load_data(self, build_vocab):
        data = []
        with open(self.path, 'r') as f:
            count = 0
            for line in f:
                count += 1
                tokens = line.strip().split() + ['<eos>']
                if build_vocab:
                    for token in tokens:
                        self.vocab.add_word(token)

                encoded_line = self.vocab.encode(tokens)
                data.extend(encoded_line)
        return data

    def create_batches(self):
        return [self.data[i*self.chunk_size: (i+1)*self.chunk_size] for i in range(self.batch_size)]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, j):
        inputs = torch.stack([
            torch.LongTensor(self.minibatches[i][j * self.seq_len : (j + 1) * self.seq_len])
            for i in range(self.batch_size)], dim=0)
        labels = torch.stack([
            torch.LongTensor(self.minibatches[i][j * self.seq_len + 1 : (j + 1) * self.seq_len + 1])
            for i in range(self.batch_size)], dim=0)

        return inputs.to(self.device), labels.to(self.device)

    def get_tokens(self, idx):
        return self.data[idx]

    def get_decoded_tokens(self, idx):
        return self.vocab.decode(self.data[idx])


train = PTBText('/content/ptb.train.txt', device=device)
val = PTBText('/content/ptb.valid.txt', vocab=train.vocab, build_vocab=False, device=device)
test = PTBText('/content/ptb.test.txt', vocab=train.vocab, build_vocab=False, device=device)

datasets = {
    'train': train,
    'val': val,
    'test': test
}

print("Vocab size:", len(train.vocab))
print("Train data size:", len(train))
print("Val data size:", len(val))
print("Test data size:", len(test))

Vocab size: 10003
Train data size: 929589
Val data size: 73760
Test data size: 82430


## Define Model Architecture

The `ZamrembaRNN` module allows for either LSTM or GRU models to be implmented with or without dropout as described in the paper

In [6]:
class ZamrembaRNN(nn.Module):
    def __init__(self, rnn_type, vocab_size, batch_size=20, embedding_dim=200, hidden_dim=200, num_layers=2, dropout=0, rnn_dropout=0):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn_type = rnn_type
        self.batch_size = batch_size
        if rnn_type == 'lstm':
            self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers, dropout=rnn_dropout, batch_first=True)
        elif rnn_type == 'gru':
            self.rnn = nn.GRU(embedding_dim, hidden_dim, num_layers, dropout=rnn_dropout, batch_first=True)
        else:
            raise ValueError("Invalid RNN type: must be 'lstm' or 'gru'")
        self.fc = nn.Linear(hidden_dim, vocab_size)
        if dropout > 0:
            self.dropout = nn.Dropout(dropout)
        else:
            self.dropout = None

        self.init_weights()

    def forward(self, input, hidden):
        output = self.embedding(input)
        if self.dropout is not None:
            output = self.dropout(output)

        #LSTM has two states (hidden& cell) where as GRU only has one hidden state
        if self.rnn_type == 'lstm':
            output, hidden = self.rnn(output, hidden)
        elif self.rnn_type == 'gru':
            output, hidden = self.rnn(output, hidden[0])

        if self.dropout is not None:
            output = self.dropout(output)

        output = self.fc(output)
        return output, hidden

    def init_weights(self):
        gen = torch.Generator().manual_seed(132)
        initrange = 0.1
        nn.init.uniform_(self.embedding.weight, -initrange, initrange, generator=gen)
        nn.init.uniform_(self.rnn.weight_ih_l0, -initrange, initrange, generator=gen)
        nn.init.uniform_(self.rnn.weight_hh_l0, -initrange, initrange, generator=gen)
        nn.init.uniform_(self.fc.weight, -initrange, initrange, generator=gen)

## Define Training & Evaluation Loops

In [8]:
def get_new_hidden(model):
    if model.rnn_type == 'lstm':
        return (torch.zeros(model.num_layers, model.batch_size, model.hidden_dim).to(device),
              torch.zeros(model.num_layers, model.batch_size, model.hidden_dim).to(device))
    elif model.rnn_type == 'gru':
        return torch.zeros(model.num_layers, model.batch_size, model.hidden_dim).to(device).unsqueeze(0)
    else:
        raise ValueError("Invalid RNN type: must be 'lstm' or 'gru'")

def detach_hidden(hidden):
    if isinstance(hidden, tuple):
        return tuple([h.detach() for h in hidden])
    else:
        return hidden.detach()

def train_epoch(model, dataset, loss_fn, optimizer, device, epoch, verbosity):
    """Train one epoch of a network"""
    model.train()
    batch_loss = 0

    hidden = get_new_hidden(model)

    for j in range(dataset.chunk_size // dataset.seq_len):

        inputs, labels = dataset[j]

        optimizer.zero_grad()
        hidden = detach_hidden(hidden)

        outputs, hidden = model(inputs, hidden)
        if model.rnn_type == 'gru':
            hidden = hidden.unsqueeze(0)

        loss = loss_fn(outputs.view(-1, outputs.shape[-1]), labels.view(-1))
        loss.backward()
        optimizer.step()

        batch_loss += loss.item()
        if (j + 1) % verbosity == 0:
            print(f'Batch #{j + 1} Loss: {batch_loss / verbosity}')
            batch_loss = 0

def perplexity(loss, batches):
    return math.exp(loss / batches)

def evaluate_model(title, model, dataset, loss_fn, seq_len, batch_size, epoch):
    model.eval()
    total_loss = 0
    num_batches = len(dataset) // (batch_size * seq_len)

    hidden = get_new_hidden(model)

    with torch.no_grad():
        for j in range(num_batches):

            inputs, labels = dataset[j]

            outputs, hidden = model(inputs, hidden)
            if model.rnn_type == 'gru':
                hidden = hidden.unsqueeze(0)
            loss = loss_fn(outputs.view(-1, outputs.shape[-1]), labels.view(-1))
            total_loss += loss.item()

    perp = perplexity(total_loss, num_batches)
    wandb.log({
            f'{title}-loss': total_loss / num_batches,
            f'{title}-perplexity': perp
        }, step=epoch)

    print(f'\033[92m{title} perplexity: {perp:.6f} ||| loss {total_loss / num_batches:.6f}\033[0m')

    return perp

def train_network(model, datasets, loss_fn, optimizer, schedule, device, epochs: int, verbosity: int):
    for epoch in range(epochs):
        lr = optimizer.param_groups[0]['lr']

        print(f'----------- Epoch #{epoch + 1}, LR: {lr} ------------')
        train_epoch(model, datasets['train'], loss_fn, optimizer, device, epoch, verbosity)
        train_perplexity = evaluate_model('Train', model, datasets['train'], loss_fn, datasets['train'].seq_len, datasets['train'].batch_size, epoch)
        val_perplexity = evaluate_model('Validation', model, datasets['val'], loss_fn, datasets['train'].seq_len, datasets['train'].batch_size, epoch)
        test_perplexity = evaluate_model('Test', model, datasets['test'], loss_fn, datasets['train'].seq_len, datasets['train'].batch_size, epoch)
        print('------------------------------------\n')

        schedule.step()
    print('----------- Train Complete! ------------')
    return {
        'train':train_perplexity,
        'val':val_perplexity,
        'test':test_perplexity
    }

## Train Models

### LSTM No Regularization

In [10]:
decay_start = 10
learning_rate_decay = 0.5
lr = 4
dropout_rate = 0

def lr_lambda(epoch):
    if epoch < decay_start:
        return 1
    else:
        return learning_rate_decay ** (epoch - (decay_start-1))

model = ZamrembaRNN('lstm', len(train.vocab)).to(device)
sgd = optim.SGD(model.parameters(), lr=lr)
cross_entropy = nn.CrossEntropyLoss()
schedule = optim.lr_scheduler.LambdaLR(sgd, lr_lambda)


run = wandb.init(project="dl-assignment2-quad", config={
    'batch_size':datasets['train'].batch_size,
    'embedding_size':model.embedding_dim,
    'hidden_units':model.hidden_dim,
    'num_lstm_layers':model.num_layers,
    'dropout_rate':dropout_rate,
    'decay_at':decay_start,
    'learning_rate_decay':learning_rate_decay,
    'learning_rate_start':lr,
    'optimizer':'SGD',
    'seq_len':datasets['train'].seq_len,
    'rnn_type':model.rnn_type
})
final_metrics = train_network(model, datasets, cross_entropy, sgd, schedule, device, 14, 500)
run.finish()

VBox(children=(Label(value='0.016 MB of 0.016 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Test-loss,█▄▄▄▄▅▅▅▂▆▁▁
Test-perplexity,█▁▁▁▁▁▁▁▁▁▁▁
Train-loss,█▄▄▄▄▅▅▅▂▆▁▁
Train-perplexity,█▁▁▁▁▁▁▁▁▁▁▁
Validation-loss,█▅▄▄▄▅▅▅▂▆▁▁
Validation-perplexity,█▁▁▁▁▁▁▁▁▁▁▁

0,1
Test-loss,6.47588
Test-perplexity,649.29014
Train-loss,6.52741
Train-perplexity,683.62764
Validation-loss,6.54278
Validation-perplexity,694.21434


----------- Epoch #1, LR: 4 ------------
Batch #500 Loss: 6.814496244430542
Batch #1000 Loss: 6.190024091720581
Batch #1500 Loss: 5.95184453201294
Batch #2000 Loss: 5.80175147819519
[92mTrain perplexity: 288.813738 ||| loss 5.665782[0m
[92mValidation perplexity: 294.080869 ||| loss 5.683855[0m
[92mTest perplexity: 287.957971 ||| loss 5.662815[0m
------------------------------------

----------- Epoch #2, LR: 4 ------------
Batch #500 Loss: 5.615063290596009
Batch #1000 Loss: 5.535354831695557
Batch #1500 Loss: 5.4458003463745115
Batch #2000 Loss: 5.38455379486084
[92mTrain perplexity: 203.791267 ||| loss 5.317096[0m
[92mValidation perplexity: 220.461390 ||| loss 5.395723[0m
[92mTest perplexity: 214.807699 ||| loss 5.369743[0m
------------------------------------

----------- Epoch #3, LR: 4 ------------
Batch #500 Loss: 5.295789780616761
Batch #1000 Loss: 5.249943510055542
Batch #1500 Loss: 5.1965356426239016
Batch #2000 Loss: 5.156316002845764
[92mTrain perplexity: 166.98

VBox(children=(Label(value='0.017 MB of 0.017 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Test-loss,█▆▄▃▃▂▂▂▂▂▁▁▁▁
Test-perplexity,█▅▄▃▂▂▂▂▁▁▁▁▁▁
Train-loss,█▆▆▅▄▄▃▃▃▂▂▁▁▁
Train-perplexity,█▅▄▃▃▃▂▂▂▂▁▁▁▁
Validation-loss,█▆▄▃▃▂▂▂▂▁▁▁▁▁
Validation-perplexity,█▅▄▃▂▂▂▁▁▁▁▁▁▁

0,1
Test-loss,4.8057
Test-perplexity,122.20443
Train-loss,4.11458
Train-perplexity,61.22637
Validation-loss,4.83841
Validation-perplexity,126.26872


### LSTM with Dropout

In [8]:
decay_start = 11
learning_rate_decay = 0.75
lr = 6
dropout_rate = 0.5
lstm_dropout = 0.2

def lr_lambda(epoch):
    if epoch < decay_start:
        return 1
    else:
        return learning_rate_decay ** (epoch - (decay_start-1))

model = ZamrembaRNN('lstm', len(train.vocab), dropout=dropout_rate, rnn_dropout=lstm_dropout).to(device)
sgd = optim.SGD(model.parameters(), lr=lr)
cross_entropy = nn.CrossEntropyLoss()
schedule = optim.lr_scheduler.LambdaLR(sgd, lr_lambda)


run = wandb.init(project="dl-assignment2-tri", config={
    'batch_size':datasets['train'].batch_size,
    'embedding_size':model.embedding_dim,
    'hidden_units':model.hidden_dim,
    'num_lstm_layers':model.num_layers,
    'dropout_rate':dropout_rate,
    'lstm_dropout':lstm_dropout,
    'decay_at':decay_start,
    'learning_rate_decay':learning_rate_decay,
    'learning_rate_start':lr,
    'optimizer':'SGD',
    'seq_len':datasets['train'].seq_len,
    'rnn_type':model.rnn_type
})
final_metrics = train_network(model, datasets, cross_entropy, sgd, schedule, device, 25, 500)
run.finish()

----------- Epoch #1, LR: 6 ------------
Batch #500 Loss: 6.763347356796265
Batch #1000 Loss: 6.229209800720215
Batch #1500 Loss: 5.98650328540802
Batch #2000 Loss: 5.827206930160522
[92mTrain perplexity: 265.904277 ||| loss 5.583136[0m
[92mValidation perplexity: 271.359342 ||| loss 5.603444[0m
[92mTest perplexity: 266.137427 ||| loss 5.584013[0m
------------------------------------

----------- Epoch #2, LR: 6 ------------
Batch #500 Loss: 5.658116242408752
Batch #1000 Loss: 5.586261615753174
Batch #1500 Loss: 5.505580702781677
Batch #2000 Loss: 5.445070253372192
[92mTrain perplexity: 183.753716 ||| loss 5.213596[0m
[92mValidation perplexity: 197.874569 ||| loss 5.287633[0m
[92mTest perplexity: 192.846486 ||| loss 5.261894[0m
------------------------------------

----------- Epoch #3, LR: 6 ------------
Batch #500 Loss: 5.376493459701538
Batch #1000 Loss: 5.333149443626404
Batch #1500 Loss: 5.2882169313430785
Batch #2000 Loss: 5.256061985969543
[92mTrain perplexity: 150.6

VBox(children=(Label(value='0.021 MB of 0.021 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Test-loss,█▆▅▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
Test-perplexity,█▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train-loss,█▆▅▅▄▄▃▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
Train-perplexity,█▅▄▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁
Validation-loss,█▆▅▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
Validation-perplexity,█▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
Test-loss,4.57084
Test-perplexity,96.62485
Train-loss,4.14693
Train-perplexity,63.23966
Validation-loss,4.60716
Validation-perplexity,100.19931


### GRU No Regularization

In [12]:
decay_start = 6
learning_rate_decay = 0.5
lr = 1
dropout_rate = 0

def lr_lambda(epoch):
    if epoch < decay_start:
        return 1
    else:
        return learning_rate_decay ** (epoch - (decay_start-1))

model = ZamrembaRNN('gru', len(train.vocab)).to(device)
sgd = optim.SGD(model.parameters(), lr=lr)
cross_entropy = nn.CrossEntropyLoss()
schedule = optim.lr_scheduler.LambdaLR(sgd, lr_lambda)


run = wandb.init(project="dl-assignment2-tri", config={
    'batch_size':datasets['train'].batch_size,
    'embedding_size':model.embedding_dim,
    'hidden_units':model.hidden_dim,
    'num_lstm_layers':model.num_layers,
    'dropout_rate':dropout_rate,
    'decay_at':decay_start,
    'learning_rate_decay':learning_rate_decay,
    'learning_rate_start':lr,
    'optimizer':'SGD',
    'seq_len':datasets['train'].seq_len,
    'rnn_type':model.rnn_type
})
final_metrics = train_network(model, datasets, cross_entropy, sgd, schedule, device, 14, 500)
run.finish()

VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

----------- Epoch #1, LR: 1 ------------
Batch #500 Loss: 6.739433595657348
Batch #1000 Loss: 6.212292663574218
Batch #1500 Loss: 5.9382801446914675
Batch #2000 Loss: 5.768225525856018
[92mTrain perplexity: 276.808866 ||| loss 5.623327[0m
[92mValidation perplexity: 284.441684 ||| loss 5.650528[0m
[92mTest perplexity: 277.617216 ||| loss 5.626243[0m
------------------------------------

----------- Epoch #2, LR: 1 ------------
Batch #500 Loss: 5.572338928222656
Batch #1000 Loss: 5.485908900260926
Batch #1500 Loss: 5.387465888977051
Batch #2000 Loss: 5.321869004249573
[92mTrain perplexity: 192.254633 ||| loss 5.258821[0m
[92mValidation perplexity: 211.627044 ||| loss 5.354825[0m
[92mTest perplexity: 206.307696 ||| loss 5.329369[0m
------------------------------------

----------- Epoch #3, LR: 1 ------------
Batch #500 Loss: 5.230482802391053
Batch #1000 Loss: 5.177635939598083
Batch #1500 Loss: 5.1146213245391845
Batch #2000 Loss: 5.074763534545898
[92mTrain perplexity: 151

VBox(children=(Label(value='0.017 MB of 0.017 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Test-loss,█▆▄▃▃▂▂▁▁▁▁▁▁▁
Test-perplexity,█▅▃▃▂▂▁▁▁▁▁▁▁▁
Train-loss,█▆▅▄▃▂▂▁▁▁▁▁▁▁
Train-perplexity,█▅▄▃▂▂▁▁▁▁▁▁▁▁
Validation-loss,█▆▄▃▃▂▂▁▁▁▁▁▁▁
Validation-perplexity,█▅▃▃▂▂▁▁▁▁▁▁▁▁

0,1
Test-loss,4.77561
Test-perplexity,118.58254
Train-loss,4.33025
Train-perplexity,75.96307
Validation-loss,4.80447
Validation-perplexity,122.0545


### GRU with Dropout

In [11]:
decay_start = 20
learning_rate_decay = 0.75
lr = 1
dropout_rate = 0.5
gru_dropout = 0.2

def lr_lambda(epoch):
    if epoch < decay_start:
        return 1
    else:
        return learning_rate_decay ** (epoch - (decay_start-1))

model = ZamrembaRNN('gru', len(train.vocab), dropout=dropout_rate, rnn_dropout=gru_dropout).to(device)
sgd = optim.SGD(model.parameters(), lr=lr)
cross_entropy = nn.CrossEntropyLoss()
schedule = optim.lr_scheduler.LambdaLR(sgd, lr_lambda)


run = wandb.init(project="dl-assignment2-tri", config={
    'batch_size':datasets['train'].batch_size,
    'embedding_size':model.embedding_dim,
    'hidden_units':model.hidden_dim,
    'num_lstm_layers':model.num_layers,
    'dropout_rate':dropout_rate,
    'lstm_dropout':gru_dropout,
    'decay_at':decay_start,
    'learning_rate_decay':learning_rate_decay,
    'learning_rate_start':lr,
    'optimizer':'SGD',
    'seq_len':datasets['train'].seq_len,
    'rnn_type':model.rnn_type
})
final_metrics = train_network(model, datasets, cross_entropy, sgd, schedule, device, 25, 500)
run.finish()

----------- Epoch #1, LR: 1 ------------
Batch #500 Loss: 6.801609323501587
Batch #1000 Loss: 6.368978764533996
Batch #1500 Loss: 6.138936006546021
Batch #2000 Loss: 6.0047424793243405
[92mTrain perplexity: 321.178028 ||| loss 5.771996[0m
[92mValidation perplexity: 322.832683 ||| loss 5.777134[0m
[92mTest perplexity: 314.067245 ||| loss 5.749607[0m
------------------------------------

----------- Epoch #2, LR: 1 ------------
Batch #500 Loss: 5.846682530403137
Batch #1000 Loss: 5.781919544219971
Batch #1500 Loss: 5.70266277217865
Batch #2000 Loss: 5.653602899551392
[92mTrain perplexity: 235.868469 ||| loss 5.463274[0m
[92mValidation perplexity: 247.050445 ||| loss 5.509593[0m
[92mTest perplexity: 240.137051 ||| loss 5.481210[0m
------------------------------------

----------- Epoch #3, LR: 1 ------------
Batch #500 Loss: 5.588469083786011
Batch #1000 Loss: 5.556520874977112
Batch #1500 Loss: 5.505293214797974
Batch #2000 Loss: 5.47390731716156
[92mTrain perplexity: 193.01

VBox(children=(Label(value='0.021 MB of 0.021 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Test-loss,█▆▅▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁
Test-perplexity,█▆▄▄▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁
Train-loss,█▇▆▅▅▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁
Train-perplexity,█▆▄▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁
Validation-loss,█▆▅▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁
Validation-perplexity,█▆▄▄▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁

0,1
Test-loss,4.65326
Test-perplexity,104.92682
Train-loss,4.2028
Train-perplexity,66.87345
Validation-loss,4.68715
Validation-perplexity,108.54369
