<a href="https://colab.research.google.com/github/lverwimp/RNN_language_modeling/blob/master/rnn_lms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Modeling with Recurrent Neural Networks

In this notebook, we will see how you can train a recurrent neural network language model.

We will start by importing TensorFlow, which is Google's open-source library for machine learning. Next, we will explain how to do data processing for language modeling and show you how we can train and test models.

## Importing TensorFlow and other requirements

We start by importing TensorFlow and checking if we are running on GPU:

In [0]:
import tensorflow as tf

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')

If the code above raised an error, you should make sure that you are using a GPU in the following way: select 'Runtime' in the top bar, then 'Change runtime type' and choose 'GPU' as hardware accelerator. Training neural networks is much faster on a GPU (graphics processing unit) than on a CPU.

Next, we do the other imports we need. The following code will allow you to upload files: you have to upload batchGenerator.py, rnn_lm.py and run_lm.py.

In [2]:
import numpy as np
import urllib, collections, os

# upload batchGenerator.py, rnn_lm.py, run_lm.py (all at once)
from google.colab import files
uploaded = files.upload()

Saving batchGenerator.py to batchGenerator.py
Saving rnn_lm.py to rnn_lm.py
Saving run_lm.py to run_lm.py


If the files are uploaded correctly, the following imports should succeed:

In [0]:
import rnn_lm, batchGenerator, run_lm
from __future__ import print_function

If the imports did not succeed, you should restart the runtime ('Runtime' in the top bar and then 'Restart runtime') and/or delete the files in the overview to the left (tab 'Files').

If you imported all libraries, you can now start the following section on data processing.

## Data processing

We will train our language models on **Penn TreeBank**, which is a publicly available benchmark dataset. A benchmark dataset can be used to easily compare models, since everyone has access to the same data. Many published papers use Penn TreeBank as dataset.

It consists of among others newspaper articles, transcribed telephone conversations and manuals. The training set contains ca. 900.000 words, the validation set ca. 70.000 words and the test set ca. 80k words. This is a very small dataset (nowadays language models can be trained on billions of words), but it is large enough for our purposes.

We now download the training, validation and test data:

In [0]:
train_url = 'http://homes.esat.kuleuven.be/~spchlab/H02A6/lab/session6/data/train.txt'
valid_url = 'http://homes.esat.kuleuven.be/~spchlab/H02A6/lab/session6/data/valid.txt'
test_url = 'http://homes.esat.kuleuven.be/~spchlab/H02A6/lab/session6/data/test.txt'
train_file = urllib.urlopen(train_url).read()
valid_file = urllib.urlopen(valid_url).read()
test_file = urllib.urlopen(test_url).read()

The data looks like this:

In [6]:
print('{0}...'.format(valid_file[:500]))

 consumers may want to move their telephones a little closer to the tv set 
 <unk> <unk> watching abc 's monday night football can now vote during <unk> for the greatest play in N years from among four or five <unk> <unk> 
 two weeks ago viewers of several nbc <unk> consumer segments started calling a N number for advice on various <unk> issues 
 and the new syndicated reality show hard copy records viewers ' opinions for possible airing on the next day 's show 
 interactive telephone technology...


The data has been **normalized**: all words not in the vocabulary are mapped to an unknown words class (<unk\>), all numbers are mapped to the 'N' class, each line contains a single sentence, punctuation has been removed, and so on. 

The purpose of normalization is among others to get rid of all information that is not necessary (such as punctuation), to solve redundancies (for example the same word can occur with different spellings, e.g. 'normalisation' or 'normalization', and we want to get rid of such variants) and to make sure the language model will be able to generalize better. An example of the latter case is the mapping of all numbers to 'N':  in the example above, 'in N years', 'N' can correspond to any number. Assume that in our training data, we see 'in 20 years' and 'in 11 years', and in our test data, we see 'in 5 years'. If '20', '11' and '5' are not mapped to 'N', we have never seen 'in 5 years' before, and the probability estimate will be worse.
  
We will now read the data, add end-of-sentence symbols (since we want to be able to predict the end of a sentence too), and count the frequency of every word in the training data:

In [0]:
# convert the string to a list and replace newlines with the end-of-sentence symbol <eos>
# ignore empty elements ''
train_text = [w for w in train_file.replace('\n',' <eos>').split(' ') if w != '']
valid_text = [w for w in valid_file.replace('\n',' <eos>').split(' ') if w != '']
test_text = [w for w in test_file.replace('\n',' <eos>').split(' ') if w != '']

# count the frequencies of the words in the training data
counter = collections.Counter(train_text)

# sort according to decreasing frequency
count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))

We can take a look at the frequencies of the words in the training set, and compare them with the frequencies of the words in the validation set. The top 20 words is quite similar:

In [6]:
# count the frequencies of the words in the validation data
counter_valid = collections.Counter(valid_text)

# sort according to decreasing frequency
count_pairs_valid = sorted(counter_valid.items(), key=lambda x: (-x[1], x[0]))

print('Top 20 most frequent words:')
print('Train (freq.)\t\tValid (freq.)')
# we can take a look a the 20 most frequent words + their frequencies:
for i in range(20):
  print('{0} ({1})\t\t{2} ({3})'.format(count_pairs[i][0],count_pairs[i][1],count_pairs_valid[i][0],count_pairs_valid[i][1]))

Top 20 most frequent words:
Train (freq.)		Valid (freq.)
the (50770)		the (4122)
<unk> (45020)		<unk> (3485)
<eos> (42068)		<eos> (3370)
N (32481)		N (2603)
of (24400)		of (1832)
to (23638)		to (1750)
a (21196)		a (1738)
in (18000)		in (1392)
and (17474)		and (1391)
's (9784)		's (868)
that (8931)		for (726)
for (8927)		$ (659)
$ (7541)		that (657)
is (7337)		it (537)
it (6112)		is (529)
said (6027)		said (513)
on (5650)		on (486)
by (4915)		at (453)
at (4894)		was (436)
as (4833)		as (402)


Given that the training text is much larger than the validation text, it is normal that the absolute frequencies in the training text are much larger. The ranking of the words is more interesting, and we see that even in the top 20, there are small differences. For the medium- and low-frequency ranges, the differences will become larger:

In [7]:
print('Train (freq.)\t\tValid (freq.)')
for i in range(200,250):
  print('{0} ({1})\t\t{2} ({3})'.format(count_pairs[i][0],count_pairs[i][1],count_pairs_valid[i][0],count_pairs_valid[i][1]))

Train (freq.)		Valid (freq.)
well (462)		ended (40)
part (461)		revenue (40)
fell (459)		see (40)
japan (459)		several (40)
another (457)		days (39)
should (457)		get (39)
higher (453)		higher (39)
debt (452)		including (39)
offer (448)		black (38)
take (448)		close (38)
including (445)		firms (38)
among (444)		general (38)
court (444)		issues (38)
being (443)		well (38)
according (442)		around (37)
each (442)		chicago (37)
index (440)		concern (37)
tax (437)		drop (37)
trade (431)		high (37)
world (431)		might (37)
reported (430)		point (37)
work (426)		sale (37)
operations (424)		sold (37)
then (422)		american (36)
computer (420)		among (36)
past (420)		decline (36)
sale (419)		financial (36)
however (416)		international (36)
our (416)		management (36)
way (416)		monday (36)
lower (413)		plunge (36)
plans (412)		she (36)
vice (412)		small (36)
economic (410)		agreed (35)
department (409)		capital (35)
end (409)		late (35)
yield (409)		losses (35)
report (406)		made (35)
sold (402)		n

We now create a mapping from words to indices. The real input for the neural network will be indices, because they take up less space and because it makes certain operations easier.

In [0]:
# words = list of all the words (in decreasing frequency)
items, _ = list(zip(*count_pairs))

# make a dictionary with a mapping from each word to an id; word with highest frequency gets lowest id etc.
item_to_id = dict(zip(items, range(len(items))))
id_to_item = dict(zip(range(len(items)), items))
vocab_size = len(item_to_id)

# convert the words to indices
train_ids_large = [item_to_id[item] for item in train_text]
valid_ids_large = [item_to_id[item] for item in valid_text]
test_ids_large = [item_to_id[item] for item in test_text]

# take a smaller subset to speed up training
train_ids = train_ids_large[:50000]
valid_ids = valid_ids_large[:10000]
test_ids = test_ids_large[:10000]

Once the data is converted to ids, it looks like this:

In [9]:
print('Here is an example of words and their indices:')
for i in range(40):
  print('{0}\t{1}'.format(valid_text[i], valid_ids[i]))
print('\nAnd this is wat the input looks like, a list of indices:')
print(valid_ids[:40])

Here is an example of words and their indices:
consumers	1132
may	93
want	358
to	5
move	329
their	51
telephones	9836
a	6
little	326
closer	2476
to	5
the	0
tv	662
set	388
<eos>	2
<unk>	1
<unk>	1
watching	2974
abc	2158
's	9
monday	381
night	1068
football	2347
can	89
now	99
vote	847
during	198
<unk>	1
for	11
the	0
greatest	3383
play	1119
in	7
N	3
years	72
from	20
among	211
four	346
or	36
five	258

And this is wat the input looks like, a list of indices:
[1132, 93, 358, 5, 329, 51, 9836, 6, 326, 2476, 5, 0, 662, 388, 2, 1, 1, 2974, 2158, 9, 381, 1068, 2347, 89, 99, 847, 198, 1, 11, 0, 3383, 1119, 7, 3, 72, 20, 211, 346, 36, 258]


## Building, training and testing neural language models

We will now define the classes and functions that we will use for training and testing our language models.

The class for an RNN language model is **rnn_lm.rnn_lm()**. We will see later which options we can use.

**batchGenerator.batchGenerator(<dataset\>)** is class that will generate mini-batches from the data. <dataset\> is a list of word ids.

batchGenerator is a class that will iterate over the data set and create **mini-batches** that will be the input for the neural network. A mini-batch contains several sentences/word sequences, and feeding mini-batches instead of a single sentence or a single word to the network speeds up the processing, and also causes better convergence of the model.

The batches are matrices of the size **batch_size* x **num_steps**. Batch_size is the number of different sequences in a single batch, and num_steps the length of each  sequence.

Here is an example of how batchGenerator can be used. You will notice that the target batch contains the same indices as the input batch, but shifted one (time) step to the right.

In [8]:
batch_size = 32
num_steps = 50

generator = batchGenerator.batchGenerator(valid_ids, batch_size=batch_size, num_steps=num_steps)
input_batch, target_batch, end_reached = generator.generate()
print('Shape of the mini-batch: {0}'.format(input_batch.shape))
print('This is what an input batch looks like:\n{0}'.format(input_batch))
print('And this is what a target batch looks like:\n{0}'.format(target_batch))

Shape of the mini-batch: (32, 50)
This is what an input batch looks like:
[[1132   93  358 ...    4  249 1795]
 [   4    3 3770 ...    2    0  361]
 [ 967   33   25 ...  769 2737    2]
 ...
 [  12    3   48 ... 1470    2   54]
 [ 505    7    1 ...  660   43  299]
 [   1 2034    8 ...   11   99   29]]
And this is what a target batch looks like:
[[  93  358    5 ...  249 1795    1]
 [   3 3770 1619 ...    0  361    4]
 [  33   25 2047 ... 2737    2 2158]
 ...
 [   3   48    7 ...    2   54 1068]
 [   7    1   50 ...   43  299 9642]
 [2034    8  377 ...   99   29   28]]


Here is a function which pretty-prints what the mini-batches look like. You can give if a batch as first argument, and the index that you want to look at. In our case, there are 32 sequences in every min-batch, so the indices range between 0 and 31 (in Python, indices always start at 0).

In [0]:
def print_batch(batch, idx):
  for i in range(num_steps):
      word = id_to_item[batch[idx][i]]
      if word == '<eos>':
         print()
      else:
        print(word, end=' ')
  print()
  print()

And here are some examples of what the first and fourth sequence of the  input and target batch look like. Try it yourself with some new values.

In [24]:
print_batch(input_batch, 0)
print_batch(target_batch, 0)

print_batch(input_batch, 3)
print_batch(target_batch, 3)

# try it yourself:
# print_batch(..., ...)

consumers may want to move their telephones a little closer to the tv set 
<unk> <unk> watching abc 's monday night football can now vote during <unk> for the greatest play in N years from among four or five <unk> <unk> 
two weeks ago viewers of several nbc 

may want to move their telephones a little closer to the tv set 
<unk> <unk> watching abc 's monday night football can now vote during <unk> for the greatest play in N years from among four or five <unk> <unk> 
two weeks ago viewers of several nbc <unk> 

says nbc has been able to charge premium rates for this ad time 
she would n't say what the premium is but it 's believed to be about N N above regular <unk> rates 
we were able to get advertisers to use their promotion budget for this because 

nbc has been able to charge premium rates for this ad time 
she would n't say what the premium is but it 's believed to be about N N above regular <unk> rates 
we were able to get advertisers to use their promotion budget for this because

In run_lm, there are two functions that can be used to train and/or test a model. 

**run_lm.run_lm():** this function can be called to build, train and test models with different parameter settings. 

**run_lm.run_epoch()**: this is a function that does one pass over the whole dataset. If we are training the model, it will update the parameters and return the perplexity. Otherwise, it will just return the perplexity.

## Word embeddings

Often the input words for a language model are represented as indices in a vocabulary, or one-hot vectors (where all values are 0 except the index of the word, which has value 1). This representation is a discrete representation, just like in n-gram language models. It has the disadvantage that relationships between words (e.g. the syntactic relationship between 'eat' and 'eating', or the semantic relationship between 'eat' and 'drink') can not be inferred from the word representations. 

Neural language models however, do not use this representation as is but first map it to a continuous, lower-dimensional vector, also called *word embedding*. They do this by looking up the index of the word in a weight matrix $\mathbf{W}$, which is often called the embedding matrix. By training the embedding matrix jointly with the rest of the language model, the resulting word embeddings will have some interesting properties: several syntactic and semantic relationships are encoded as vector offsets in the embedding space. A famous example is the vector offset for male - female, which is shown in the example below:

![alt text](https://github.com/lverwimp/RNN_language_modeling/blob/master/kingqueen.png?raw=1)

Let's now train a language model and return the embedding matrix of the trained model:

In [0]:
emb_matrix = run_lm(cell='LSTM', 
                    optimizer='Adam', 
                    lr=0.01, 
                    inspect_emb=True, 
                    train_ids=train_ids, 
                    valid_ids=valid_ids, 
                    test_ids=test_ids)

INFO:tensorflow:Restoring parameters from models/rnn.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Saving checkpoint to path models/model.ckpt
INFO:tensorflow:Starting queue runners.
Epoch 1
Train perplexity: 205.207137658
Validation perplexity: 719.099237175
Epoch 2
Train perplexity: 183.706281768
Validation perplexity: 754.872969176
Epoch 3
Train perplexity: 166.842466179
Validation perplexity: 758.458571841
Epoch 4
Train perplexity: 152.107296048
Validation perplexity: 790.319179867
Epoch 5
Train perplexity: 139.465680327
Validation perplexity: 839.929676526
('Saved the model to ', 'models/rnn.ckpt')
Test perplexity: 624.673774295


In [0]:
def find_closest_words(emb_matrix, word):
  if word not in item_to_id:
    raise IOError('This item is not in the vocabulary')
    
  else:
    id_w = item_to_id[word]
    emb_w = emb_matrix[id_w]
    norm_emb_w = emb_w / np.linalg.norm(emb_w)
    
    top_10 = {}
    
    # iterate over all words
    for idx in range(emb_matrix.shape[0]):
      # ignore the word itself
      if idx != id_w:
        
        norm_curr_w = emb_matrix[idx] / np.linalg.norm(emb_matrix[idx])
        
        cos_sim = np.dot(norm_emb_w, norm_curr_w)
        
        #cos_sim = np.dot(emb_w, emb_matrix[idx]) / \
        #   norm_emb_w * np.linalg.norm(emb_matrix[idx])
        
        #print('{0}\t{1}'.format(id_to_item[idx], cos_sim))
        
        # keep list of top 10 largest cos similarities
        if len(top_10) >= 10:
          for sim in top_10.iterkeys():
            if cos_sim > sim:
              
              #print(cos_sim)
              #print('add new')
              #print(top_10)
              
              del top_10[sim]
              top_10[cos_sim] = id_to_item[idx]
              break
        
        else:
          top_10[cos_sim] = id_to_item[idx]
          
        
    print('Words with largest cosine similarity w.r.t. {0}'.format(word))
    print('Word\t\tCosine similarity')
    # sort the top 10 
    for sim in sorted(top_10, key=float):
      print('{0}\t\t{1}'.format(top_10[sim], sim))
      
     
    
 
    
    

In [0]:
np.save('emb_matrix_large.npy', emb_matrix_large)

find_closest_words(emb_matrix_large, 'test')
find_closest_words(emb_matrix_large, 'cat')

In [0]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
principalComponents = pca.fit_transform(emb_matrix_large)
print(principalComponents)
print(principalComponents[:,0].shape)

colors = ['navy', 'turquoise', 'darkorange', 'red', 'black', 'blue','yellow','green']
target_names = ['cat', 'dog', 'elephant', 'tiger', 'mouse', 'driving','walking','flying']

for color, target_name in zip(colors, target_names):
    plt.scatter(principalComponents[item_to_id[target_name], 0], 
                principalComponents[item_to_id[target_name], 1], 
                color=color, 
                label=target_name)
plt.legend()

## Training networks

Training neural networks requires a lot of hyperparameter tuning. The hyperparameters of a neural network are for example the type of cell, its size, the method that is used for updating its parameters (also called 'optimizer' ), the type and strength of regularization, ... . All these hyperparameters have to be chosen before the network can built, trained and tested, and they all have to some extent an influence on the  performance of the model.

Recurrent neural networks are neural networks that take as input a combination of the standard input and the hidden state of the previous time step. Let's first train a simple recurrent neural network (RNN) as a language model. 

### Optimizer

Let's now train a simple RNN as language model.

In [0]:
run_lm(cell='RNN')

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Epoch 1
train_ppl: 28579.2624974
valid_ppl: 7976.3365824
Epoch 2
train_ppl: 10458.1135736
valid_ppl: 1364.88939657
Epoch 3
train_ppl: 1536.37432389
valid_ppl: 2130.04265576
Epoch 4
train_ppl: 870.760132806
valid_ppl: 709.111377606
Epoch 5
train_ppl: 713.030860672
valid_ppl: 724.993998427
test_ppl: 689.391726604


You see that both the training perplexity and the validation perplexity decreased during training, which is a good sign. However, notice that the validation perplexity of epoch 5 is slightly higher than the validation perplexity of epoch 4. 


In [0]:
run_lm(cell='RNN', optimizer='Adam')

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Epoch 1
train_ppl: 8.3480932363e+189
valid_ppl: 8.94273109753e+240
Epoch 2
train_ppl: 8.90212615289e+269
valid_ppl: 2.70133700595e+271
Epoch 3




train_ppl: inf
valid_ppl: inf
Epoch 4
train_ppl: inf
valid_ppl: inf
Epoch 5
train_ppl: inf
valid_ppl: inf
test_ppl: inf


In [0]:
emb_matrix_large = run_lm(cell='LSTM', optimizer='Adam', lr=0.01, inspect_emb=True, large_data=True)

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Epoch 1
Train perplexity: 297.18710552
Validation perplexity: 234.21438526
Epoch 2
Train perplexity: 204.754365857
Validation perplexity: 213.141694848
Epoch 3
Train perplexity: 184.758054706
Validation perplexity: 204.33167339
Epoch 4
Train perplexity: 174.919933556
Validation perplexity: 200.768920798
Epoch 5
Train perplexity: 168.839457556
Validation perplexity: 199.088644593
('Saved the model to ', 'models/rnn.ckpt')
Test perplexity: 186.386608675


### Learning rate

Judging from the perplexities above, it seems like the Adam optimizer is a bad choice for training our network! However, the interplay between the different hyperparameters of a neural network is complicated, and it is very well possible that a specific optimizer needs a different learning rate. 
Let's try a learning rate of 0.01 instead of 1:

In [0]:
run_lm(cell='RNN', optimizer='Adam', lr=0.01)

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Epoch 1
train_ppl: 1057.75617242
valid_ppl: 958.832506035
Epoch 2
train_ppl: 564.324279113
valid_ppl: 826.433342462
Epoch 3
train_ppl: 433.956799222
valid_ppl: 793.489183177
Epoch 4
train_ppl: 364.952622399
valid_ppl: 723.403317321
Epoch 5
train_ppl: 324.387741371
valid_ppl: 699.624224383
test_ppl: 532.177735851


This time, the network is converging nicely. Maybe reducing the learning rate even further helps? Let's try a learning rate of 0.001:

In [0]:
run_lm(cell='RNN', optimizer='Adam', lr=0.001)

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Epoch 1
train_ppl: 647.90650982
valid_ppl: 490.249928851
Epoch 2
train_ppl: 378.543344185
valid_ppl: 413.048300797
Epoch 3
train_ppl: 300.501997416
valid_ppl: 375.886351108
Epoch 4
train_ppl: 258.467386044
valid_ppl: 367.867747945
Epoch 5
train_ppl: 232.486608834
valid_ppl: 356.596671902
test_ppl: 286.110701589


We see an additional improvement. Let's see what reducing the learning rate even further, to 0.0001, gives:

In [0]:
run_lm(cell='RNN', optimizer='Adam', lr=0.0001)

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Epoch 1
train_ppl: 1366.65057312
valid_ppl: 677.639747264
Epoch 2
train_ppl: 630.575866312
valid_ppl: 690.140848653
Epoch 3
train_ppl: 625.483174618
valid_ppl: 707.694743583
Epoch 4
train_ppl: 620.683702468
valid_ppl: 1629.80400103
Epoch 5
train_ppl: 668.006540818
valid_ppl: 1913.05069198
test_ppl: 1789.96267687


Here we see an interesting result: the training perplexity decreased between epoch 0 and 4, but the validation perplexity is continuously increasing. For epoch 5, even the training perplexity increased again. This is an example of a learning rate that is too small: the steps that the network is making are too small.

### Type of RNN cell

A simple RNN has some disadvantages: it often suffers from the so-called *vanishing and exploding gradients* problem. Neural networks are trained with an algorithm called backpropagation, which computes the gradients of the loss with respect to all parameters in the network. For a language model, the loss of the network is called the *cross entropy*, and it is equal to the average negative log probability for every word in the data. The perplexity of the language model is simply the exponential of the cross entropy. In the case of the simple RNN shown above, the parameters would be the weight matrices $\mathbf{W}$, $\mathbf{U}$ and $\mathbf{V}$ and the bias vectors $\mathbf{b}$ and $\mathbf{b_v}$. The gradients of the loss with respect to the parameters $\mathbf{V}$ and $\mathbf{b_v}$ can be calculated directly, but the gradients with respect to the other parameters in the network are calculated based on the chain rule, which results in multiplying many terms. Moreover, an RNN is typically *unrolled in time*, which means that you also want to update the weights for the words seen before. If the terms in the multiplication are very small or very  large, they can quickly get even smaller (vanish) or larger (explode). The exploding gradients problem can relatively easily be solved by clipping the (norm of) the gradients if they become too large, the vanishing gradients problem is (at least partially) solved by using another type of RNN cell, such as a long short-term memory (LSTM) cell.

An LSTM contains two hidden states instead of one, a cell state $\mathbf{c}_t$ and a hidden state $\mathbf{h}_t$, and  three gates, the input gate, forget gate and output gate. The gates are shown in the upper part of the figure below: they have a sigmoid activation function, which makes sure that the output values are all between 0 and 1. The forget gate $\mathbf{f}_t$ is combined with the cell state $\mathbf{c}_t$: it thus decides which parts of the previous cell state should be forgetten (values close to 0) and which not (values close to 1). A new cell state is then calculated based on a combination of the input gate $\mathbf{i}_t$, which decides what should be added, and the candidate values $\mathbf{p}_t$, which are the result of a $tanh$ non-linearity. The new cell state $\mathbf{c}_t$ is then put through another $tanh$, and combined with the output gate, which decides which part of the input should be let through to the new hidden state $\mathbf{h}_t$. The last part of the network is the equal to the simple RNN. This [blog post](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) gives a great description of how an LSTM cell works.

![alt text](https://github.com/lverwimp/RNN_language_modeling/blob/master/LSTM.png?raw=1)

We take the optimal combination of optimizer (Adam) and learning rate (0.001) for an RNN, and use it to train an LSTM:

In [0]:
run_lm(cell='LSTM', optimizer='Adam', lr=0.001)

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Epoch 1
train_ppl: 656.82063114
valid_ppl: 529.864473663
Epoch 2
train_ppl: 424.844078727
valid_ppl: 464.263171504
Epoch 3
train_ppl: 356.302922604
valid_ppl: 419.632020537
Epoch 4
train_ppl: 312.305264384
valid_ppl: 397.430353409
Epoch 5
train_ppl: 279.950894164
valid_ppl: 376.370915052
test_ppl: 308.514639823


Surprise! The LSTM gives a worse test perplexity, 308.5, than the RNN with the same hyperparameters, 286.1. Looking at the evolution of the perplexities over epochs, we see that they only slowly decrease. Maybe we need a larger learning rate for an LSTM?

In [0]:
run_lm(cell='LSTM', optimizer='Adam', lr=0.01, inspect_emb=True)

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Epoch 1
train_ppl: 481.709334113
valid_ppl: 396.625967937
Epoch 2
train_ppl: 286.597902762
valid_ppl: 355.204807228
Epoch 3
train_ppl: 235.102307065
valid_ppl: 341.794086337
Epoch 4
train_ppl: 207.242568642
valid_ppl: 341.275447747
Epoch 5
train_ppl: 188.696659696
valid_ppl: 349.23735006
test_ppl: 264.915329849


TypeError: ignored

This perplexity is already much better. By further optimizing of the learning rate and/or optimizer, we could probably get even lower perplexities.

### Size of the embedding

Let's now take a look at the influence of the size of the LSTM on its performance. By default, we train a model with embeddings of size 64 and a hidden layer of size 128. Let's see what happens if we reduce the size of the embedding:

In [0]:
run_lm(cell='LSTM', optimizer='Adam', lr=0.01, embedding_size=16)

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Epoch 1
train_ppl: 519.359682418
valid_ppl: 5124521.52183
Epoch 2
train_ppl: 325.127520433
valid_ppl: 384.575596662
Epoch 3
train_ppl: 273.853564489
valid_ppl: 366.642785843
Epoch 4
train_ppl: 245.163913857
valid_ppl: 358.26870915
Epoch 5
train_ppl: 225.340377224
valid_ppl: 356.266744414
test_ppl: 288.184551098


### Size of the hidden layer

In [0]:
run_lm(cell='LSTM', optimizer='Adam', lr=0.01, hidden_size=64)

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Epoch 1
train_ppl: 522.602473469
valid_ppl: 433.018669464
Epoch 2
train_ppl: 325.534562413
valid_ppl: 382.469223656
Epoch 3
train_ppl: 271.586685795
valid_ppl: 365.897337815
Epoch 4
train_ppl: 241.188401725
valid_ppl: 356.627466171
Epoch 5
train_ppl: 221.587265914
valid_ppl: 352.766427407
test_ppl: 279.318501236


### Regularization

In [0]:
run_lm(cell='LSTM', optimizer='Adam', lr=0.01, dropout_rate=0.1)

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Epoch 1
train_ppl: 809.953863498
valid_ppl: 658.909363669
Epoch 2
train_ppl: 584.828653476
valid_ppl: 606.689601129
Epoch 3
train_ppl: 539.409788271
valid_ppl: 585.649933925
Epoch 4
train_ppl: 519.498692319
valid_ppl: 574.898769121
Epoch 5
train_ppl: 508.607510944
valid_ppl: 574.182146616
test_ppl: 517.1558548


## Testing

Let's now test a trained network by calculating the log probability for a sentence. To do this, we first convert the sentence to indices and then run the model:

In [0]:
def get_log_prob(cell='LSTM', optimizer='SGD', lr=1, 
           embedding_size=64, hidden_size=128, 
           dropout_rate=0.5, train_ids=None,
           valid_ids=None, test_sent=None):
  
  # convert words to indices
  test_idx = []
  for w in test_sent.split(' '):
    if w not in item_to_id:
      raise IOError("{0} is not part of the vocabulary".format(w))
    else:
      test_idx.append(item_to_id[w])

  run_lm(cell=cell, 
         optimizer=optimizer, 
         lr=lr,
         embedding_size=embedding_size,
         hidden_size=hidden_size,
         dropout_rate=dropout_rate,
         inspect_emb=False, 
         train_ids=train_ids, 
         valid_ids=valid_ids, 
         test_ids=test_idx,
         test_log_prob=True)

      


To get the log probability of a specific sentence, use the following commands:

In [0]:
get_log_prob(test_sent='this is a test')
get_log_prob(test_sent='test a a a')

INFO:tensorflow:Restoring parameters from models/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
INFO:tensorflow:Saving checkpoint to path models/model.ckpt
Log probability: -15.9990825653
INFO:tensorflow:Restoring parameters from models/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Saving checkpoint to path models/model.ckpt
INFO:tensorflow:Starting queue runners.
Log probability: -24.7472014427


You should see that the log probability of 'test a a a' is lower than 'this is a test', which makes sense. You can test your own sentences here:

In [0]:
# get_log_prob('your own test sentence')