<a href="https://colab.research.google.com/github/mikkelbrusen/custom-pytorch-lstm-lm/blob/master/Language_Model_Using_a_Custom_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Language Model Using a Custom Build LSTM
This notebook contains the code to implement a language model using a custom LSTM built within the PyTorch framework.

This model was implemented as part of a project in the course 02456 Deep Learning @ DTU - Technical University of Denmark
+ This code was originally forked from the [PyTorch word level language modeling example](https://github.com/pytorch/examples/tree/master/word_language_model).
+ The code in this notebook is available on [google colab](https://colab.research.google.com/drive/1luim4qegwBeKVAAzW-XarPSCOhogiBQf) and on [github](https://github.com/mikkelbrusen/custom-pytorch-lstm-lm).

The model comes with instructions to train a word level language models over the Penn Treebank (PTB).

The project was carried out by [Gustav Madslund](https://github.com/gustavmadslund) and [Mikkel Møller Brusen](https://github.com/mikkelbrusen).

# Setup
This section contains all the necessary setup as hyperparameters, data processing and utility functions

## Google Colab Setup
Since we are running on Google Colab, we will need to install PyTorch as they only support TensorFlow by default, because, well, they are Google and not Facebook.

In [0]:
# http://pytorch.org/
from os.path import exists
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.4.1-{platform}-linux_x86_64.whl torchvision
import torch

We will need some data to train on, and a place to save our model. 
We connect to google drive and position our data in the following path: *MyDrive/NLP/data/penn/* which needs to be put in 

In [0]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

## Imports and params





In [0]:
import argparse
import time
import math
import os
import torch
import torch.nn as nn
from torch.autograd import Variable
from collections import Counter

In [0]:
args_cuda = torch.cuda.is_available()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyper-parameters
args_train_batch_size = 20 # batch size
args_bptt = 35 # sequence length
args_embed_size = 450 # 
args_hidden_size = 650 
args_num_layers = 2 # Number of LSTM layers
args_num_epochs = 40
args_learning_rate = 20
args_dropout = 0.5
args_clip = 0.25
args_log_interval = 100
args_seed = 1111 # We seed the RNG's for reproducability

# if you dont already have the penn treebank data, grab it from our github repo
# here: https://github.com/mikkelbrusen/custom-pytorch-lstm-lm
args_data = "/content/gdrive/My Drive/NLP/data/penn/"

# The file in which we want to save our trained model.
args_save = "/content/gdrive/My Drive/NLP/save/Custom_LSTM_Model.pt"


torch.manual_seed(args_seed)
if args_cuda:
  torch.cuda.manual_seed(args_seed)

## The data loader
Dictionary and corpus to process the dataset

In [0]:
class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []
        self.counter = Counter()
        self.total = 0

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        token_id = self.word2idx[word]
        self.counter[token_id] += 1
        self.total += 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)


class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
        self.test = self.tokenize(os.path.join(path, 'test.txt'))

    def tokenize(self, path):
        """Tokenizes a text file."""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r') as f:
            tokens = 0
            for line in f:
                words = line.split() + ['<eos>']
                tokens += len(words)
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r') as f:
            ids = torch.LongTensor(tokens)
            token = 0
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    ids[token] = self.dictionary.word2idx[word]
                    token += 1

        return ids

## Utils
Utility functions which will be used while training, validating and testing

In [0]:
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)


# Starting from sequential data, batchify arranges the dataset into columns.
# For instance, with the alphabet as the sequence and batch size 4, we'd get
# ┌ a g m s ┐
# │ b h n t │
# │ c i o u │
# │ d j p v │
# │ e k q w │
# └ f l r x ┘.
# These columns are treated as independent by the model, which means that the
# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
# batch processing.

def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)


# get_batch subdivides the source data into chunks of length args.bptt.
# If source is equal to the example output of the batchify function, with
# a bptt-limit of 2, we'd get the following two Variables for i = 0:
# ┌ a g m s ┐ ┌ b h n t ┐
# └ b h n t ┘ └ c i o u ┘
# Note that despite the name of the function, the subdivison of data is not
# done along the batch dimension (i.e. dimension 1), since that was handled
# by the batchify function. The chunks are along dimension 0, corresponding
# to the seq_len dimension in the LSTM.

def get_batch(source, i):
    #seq_len = min(args_bptt, len(source) - 1 - i)
    data = source[i-args_bptt:i]
    target = source[i+1-args_bptt:i+1].view(-1)
    return data, target

## Process data
Load the dataset and make train, validaiton and test sets

In [0]:
corpus = Corpus(args_data)

eval_batch_size = 10
test_batch_size = 10
train_data = batchify(corpus.train, args_train_batch_size)
val_data = batchify(corpus.valid, eval_batch_size)
test_data = batchify(corpus.test, test_batch_size)

# Custom LSTM
We chose to implement our LSTM modules as single layer modules, meaning that the multiple layers will be created within our model rather than within the LSTM module.

### Dimensions
An analysis of the dimensions can be found in the following figures

**Whole Model**

https://www.lucidchart.com/invitations/accept/b77330fa-348f-47e4-b0c1-ac13d3b72a81

**LSTM**

https://www.lucidchart.com/invitations/accept/6a965c89-0e68-40b6-b3c6-bb4d0b2e6619

In [0]:
#LSTM Module
class LSTM(nn.Module):
  def __init__(self,input_size,hidden_size,bias=False):
    super(LSTM, self).__init__()
    self.input_size = input_size
    self.hidden_size = hidden_size
    
    self.weight_fx = nn.Linear(input_size, hidden_size, bias=bias)
    self.weight_ix = nn.Linear(input_size, hidden_size, bias=bias)
    self.weight_cx = nn.Linear(input_size, hidden_size, bias=bias)
    self.weight_ox = nn.Linear(input_size, hidden_size, bias=bias)
    
    self.weight_fh = nn.Linear(hidden_size, hidden_size, bias=bias)
    self.weight_ih = nn.Linear(hidden_size, hidden_size, bias=bias)
    self.weight_ch = nn.Linear(hidden_size, hidden_size, bias=bias)
    self.weight_oh = nn.Linear(hidden_size, hidden_size, bias=bias)
    
    
  def forward(self,input, hidden):
    h,c = hidden
    def recurrence(inp, hidden):
      """Recurrence helper."""
      h,c = hidden
      
      f_g = torch.sigmoid(self.weight_fx(inp) + self.weight_fh(h))
      i_g = torch.sigmoid(self.weight_ix(inp) + self.weight_ih(h))
      o_g = torch.sigmoid(self.weight_ox(inp) + self.weight_oh(h))
      c_tilda = torch.tanh(self.weight_cx(inp) + self.weight_ch(h))
      c_t = f_g * c + i_g * c_tilda
      h_t = o_g * torch.tanh(c_t)
    
      return h_t, c_t
      #--------------
    
    output = []
    for inp in input:
      h,c = recurrence(inp, (h,c))
      output.append(h)

    # torch.cat(output, 0).size()=torch.Size([700, 650]) view(input.size(0)=35, *output[0].size()=20 650)
    output = torch.cat(output, 0).view(input.size(0), *output[0].size())
    return output, (h,c)
      

# Language Model using our Custom LSTM
First we define our model

In [0]:
class LSTMModel(nn.Module):
    def __init__(self, num_tokens, embed_size, hidden_size, output_size, dropout=0.5, n_layers=1):
        super(LSTMModel, self).__init__()
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(num_tokens, embed_size)
        
        # We add each LSTM layer to the module list such that pytorch is aware 
        # of their parameters for when we perform gradient decent
        self.layers = nn.ModuleList()
        for l in range(n_layers):
          layer_input_size = embed_size if l == 0 else hidden_size
          self.layers.append(LSTM(layer_input_size, hidden_size))
          
        self.decoder = nn.Linear(hidden_size, output_size)
        
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        
        self.init_weights()
       

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, inp, hidden):
        emb = self.drop(self.encoder(inp))
        
        output= emb
        h_0, c_0 = hidden
        h, c = [], []
        
        # Iterate over each LSTM layer, and pass the output from one layer on to the next 
        for i, layer in enumerate(self.layers): 
            output, (h_i, c_i) = layer(output, (h_0[i], c_0[i]))
            output = self.drop(output)
            
            h += [h_i]
            c += [c_i]
        
        h = torch.stack(h)
        c = torch.stack(c)
 
        decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
        decoded = decoded.view(output.size(0), output.size(1), decoded.size(1))
    
        return decoded, (h,c)

    def init_hidden(self,bsz):
        h_0 = Variable(torch.zeros(self.n_layers, bsz, self.hidden_size)).cuda()
        c_0 = Variable(torch.zeros(self.n_layers, bsz, self.hidden_size)).cuda()
        return (h_0, c_0)


Then we build the model and specify our loss function

In [0]:
ntokens = len(corpus.dictionary)

model = LSTMModel(ntokens, args_embed_size, args_hidden_size, ntokens, args_dropout, args_num_layers).to(device)
criterion = nn.CrossEntropyLoss()

# Train model

First we define our training and evalutation

In [0]:
def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    total_words = len(data_source) - (len(data_source) % args_bptt)
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(test_batch_size)
    with torch.no_grad():
        for i in range(args_bptt, data_source.size(0) - 1, args_bptt):
            data, targets = get_batch(data_source, i)
            output, hidden = model(data, hidden)
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * criterion(output_flat, targets).item()
            hidden = repackage_hidden(hidden)
    return total_loss / (total_words  - 1)


def train():
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(args_train_batch_size)
    for batch, i in enumerate(range(args_bptt, train_data.size(0) - 1, args_bptt)):
        data, targets = get_batch(train_data, i)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        hidden = repackage_hidden(hidden)
        model.zero_grad()
        output, hidden = model(data, hidden)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(model.parameters(), args_clip)
        for p in model.parameters():
            p.data.add_(-lr, p.grad.data)

        total_loss += loss.item()

        if batch % args_log_interval == 0 and batch > 0:
            cur_loss = total_loss / args_log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // args_bptt, lr,
                elapsed * 1000 / args_log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()



Then do the actual training

In [0]:
# Loop over epochs.
lr = args_learning_rate
best_val_loss = None

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, args_num_epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open(args_save, 'wb') as f:
                torch.save(model, f)
            
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')



| epoch   1 |   100/ 1327 batches | lr 20.00 | ms/batch 156.54 | loss  7.08 | ppl  1189.33
| epoch   1 |   200/ 1327 batches | lr 20.00 | ms/batch 153.69 | loss  6.26 | ppl   523.80
| epoch   1 |   300/ 1327 batches | lr 20.00 | ms/batch 154.92 | loss  6.02 | ppl   411.92
| epoch   1 |   400/ 1327 batches | lr 20.00 | ms/batch 154.25 | loss  5.85 | ppl   345.59
| epoch   1 |   500/ 1327 batches | lr 20.00 | ms/batch 154.90 | loss  5.76 | ppl   317.08
| epoch   1 |   600/ 1327 batches | lr 20.00 | ms/batch 152.67 | loss  5.70 | ppl   299.46
| epoch   1 |   700/ 1327 batches | lr 20.00 | ms/batch 154.75 | loss  5.59 | ppl   268.20
| epoch   1 |   800/ 1327 batches | lr 20.00 | ms/batch 155.21 | loss  5.47 | ppl   237.68
| epoch   1 |   900/ 1327 batches | lr 20.00 | ms/batch 155.04 | loss  5.45 | ppl   233.12
| epoch   1 |  1000/ 1327 batches | lr 20.00 | ms/batch 154.62 | loss  5.45 | ppl   232.37
| epoch   1 |  1100/ 1327 batches | lr 20.00 | ms/batch 153.48 | loss  5.32 | ppl   204.66

  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


| epoch   2 |   100/ 1327 batches | lr 20.00 | ms/batch 158.70 | loss  5.26 | ppl   192.67
| epoch   2 |   200/ 1327 batches | lr 20.00 | ms/batch 153.62 | loss  5.25 | ppl   190.80
| epoch   2 |   300/ 1327 batches | lr 20.00 | ms/batch 154.19 | loss  5.23 | ppl   187.53
| epoch   2 |   400/ 1327 batches | lr 20.00 | ms/batch 155.38 | loss  5.15 | ppl   171.61
| epoch   2 |   500/ 1327 batches | lr 20.00 | ms/batch 154.56 | loss  5.13 | ppl   168.96
| epoch   2 |   600/ 1327 batches | lr 20.00 | ms/batch 153.95 | loss  5.15 | ppl   173.14
| epoch   2 |   700/ 1327 batches | lr 20.00 | ms/batch 154.48 | loss  5.10 | ppl   163.83
| epoch   2 |   800/ 1327 batches | lr 20.00 | ms/batch 155.26 | loss  4.99 | ppl   147.66
| epoch   2 |   900/ 1327 batches | lr 20.00 | ms/batch 154.90 | loss  5.05 | ppl   155.49
| epoch   2 |  1000/ 1327 batches | lr 20.00 | ms/batch 154.54 | loss  5.06 | ppl   158.21
| epoch   2 |  1100/ 1327 batches | lr 20.00 | ms/batch 155.10 | loss  4.93 | ppl   137.86

Finally,  open the best saved model and run it on the test data

In [0]:
# Load the best saved model.
with open(args_save, 'rb') as f:
    model = torch.load(f)

# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)


| End of training | test loss  4.44 | test ppl    85.00


# Word generator

First define the arguments and load the corpus

In [0]:
# Load the best saved model.

with open(args_save, 'rb') as f:
    model = torch.load(f)

model.eval()

corpus = Corpus(args_data)
ntokens = len(corpus.dictionary)
hidden = model.init_hidden(1)



Then generate some data

In [0]:
args_words = 300
args_temperature = 0.9
input = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device)

words = []
probs = []

with torch.no_grad():  # no tracking history
    for i in range(args_words):
        output, hidden = model(input, hidden)
        
        word_weights = output.squeeze().div(args_temperature).exp().cpu()
        word_idx = torch.multinomial(word_weights, 1)[0]
        input.fill_(word_idx)
        word = corpus.dictionary.idx2word[word_idx]
        
        # We replace <unk> and <eos> with * to get a cleaner look, but thats just
        # personal preference
        if(word == "<unk>" or word == "<eos>"):
          word = "*"

        print(word + ('\n' if i % 20 == 19 else ' '),end='')
        
        # We also create arrays with the generated words and their probability 
        # to be used for visualizing them in a tool that we created for this
        # purpose
        words.append(word)
        probs.append(output.squeeze()[word_idx].data.tolist())

by the big board * the stock market basket gained N to N * the stock market reacted to the
lows of market sentiment about N N monday according to the european community commission * the insurers requested that many
* companies posted signs of earnings surge in the u.s. * the u.s. fed 's ratio was a more *
than $ N million in the first nine months of N said a * spokesman * for the first nine
months of this year the company added N N to N N in the third quarter compared with N million
the second quarter in N * but while imports rose N N in september * the company said the *
and * for the entire year fell about N N from july 's $ N billion * mr. * said
the levels on the thrift industry has been * by florida and the national aeronautics and space administration * his
order to cut off more reserves in private lending is particularly substantial and produce less than N N of the
wage increases a year ago * some * to the move which are the most * way along with its
farm inventories of seasonal

Print the words and probabilities for use in a [vizualization tool](https://github.com/mikkelbrusen/text-weight-visualizer) we created 

In [0]:
print(words)
print(probs)

['by', 'the', 'big', 'board', '*', 'the', 'stock', 'market', 'basket', 'gained', 'N', 'to', 'N', '*', 'the', 'stock', 'market', 'reacted', 'to', 'the', 'lows', 'of', 'market', 'sentiment', 'about', 'N', 'N', 'monday', 'according', 'to', 'the', 'european', 'community', 'commission', '*', 'the', 'insurers', 'requested', 'that', 'many', '*', 'companies', 'posted', 'signs', 'of', 'earnings', 'surge', 'in', 'the', 'u.s.', '*', 'the', 'u.s.', 'fed', "'s", 'ratio', 'was', 'a', 'more', '*', 'than', '$', 'N', 'million', 'in', 'the', 'first', 'nine', 'months', 'of', 'N', 'said', 'a', '*', 'spokesman', '*', 'for', 'the', 'first', 'nine', 'months', 'of', 'this', 'year', 'the', 'company', 'added', 'N', 'N', 'to', 'N', 'N', 'in', 'the', 'third', 'quarter', 'compared', 'with', 'N', 'million', 'the', 'second', 'quarter', 'in', 'N', '*', 'but', 'while', 'imports', 'rose', 'N', 'N', 'in', 'september', '*', 'the', 'company', 'said', 'the', '*', 'and', '*', 'for', 'the', 'entire', 'year', 'fell', 'about',