[View in Colaboratory](https://colab.research.google.com/github/lverwimp/RNN_language_modeling/blob/master/rnn_lms.ipynb)

Imports:

In [0]:
import tensorflow as tf
import urllib, collections

Enable eager execution in TensorFlow:

In [0]:
tf.enable_eager_execution()

Get training, validation and test data:

In [0]:
train_url = 'http://homes.esat.kuleuven.be/~lverwimp/course_speech_recognition/train.txt'
valid_url = 'http://homes.esat.kuleuven.be/~lverwimp/course_speech_recognition/valid.txt'
test_url = 'http://homes.esat.kuleuven.be/~lverwimp/course_speech_recognition/test.txt'
train_file = urllib.urlopen(train_url).read()
valid_file = urllib.urlopen(valid_url).read()
test_file = urllib.urlopen(test_url).read()

The data looks like this:

In [44]:
print('{0}...'.format(valid_file[:500]))

 consumers may want to move their telephones a little closer to the tv set 
 <unk> <unk> watching abc 's monday night football can now vote during <unk> for the greatest play in N years from among four or five <unk> <unk> 
 two weeks ago viewers of several nbc <unk> consumer segments started calling a N number for advice on various <unk> issues 
 and the new syndicated reality show hard copy records viewers ' opinions for possible airing on the next day 's show 
 interactive telephone technology...


<unk\> is a symbol for the unknown words class, 'N' is a symbol used for the numbers class.
  
Convert data to correct format:

In [0]:
# convert the string to a list and replace newlines with the end-of-sentence symbol
train_text = [w for w in train_file.replace('\n',' <eos>').split(' ')]
valid_text = [w for w in valid_file.replace('\n',' <eos>').split(' ')]
test_text = [w for w in test_file.replace('\n',' <eos>').split(' ')]

# count the frequencies of the words in the training data
counter = collections.Counter(train_text)

# sort according to decreasing frequency
count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))

# words = list of all the words (in decreasing frequency)
items, _ = list(zip(*count_pairs))

# make a dictionary with a mapping from each word to an id; word with highest frequency gets lowest id etc.
item_to_id = dict(zip(items, range(len(items))))

# convert the words to indices
train_ids = [item_to_id[item] for item in train_text]
valid_ids = [item_to_id[item] for item in valid_text]
test_ids = [item_to_id[item] for item in test_text]

Once the data is converted to ids, it looks like this:

In [50]:
print(valid_ids[:100])

[2, 1133, 94, 359, 6, 330, 52, 9837, 7, 327, 2477, 6, 0, 663, 389, 2, 3, 1, 1, 2975, 2159, 10, 382, 1069, 2348, 90, 100, 848, 199, 1, 12, 0, 3384, 1120, 8, 4, 73, 21, 212, 347, 37, 259, 1, 1, 2, 3, 76, 423, 196, 3918, 5, 250, 1796, 1, 581, 3529, 893, 2375, 7, 4, 298, 12, 2710, 17, 1187, 1, 251, 2, 3, 9, 0, 36, 9923, 3748, 465, 711, 2999, 2038, 3918, 135, 6146, 12, 495, 5895, 17, 0, 131, 273, 10, 465, 2, 3, 9959, 733, 504, 31, 642, 7, 36, 6499]


Class for the language model:

In [0]:
class rnn_lm(object):
  '''
  This is a class to build and execute a recurrent neural network language model.
  '''
  
  def __init__(self,
              cell='LSTM',
              vocab_size=10000,
              embedding_size=64,
              hidden_size=128,
              dropout_rate=0.5):
    self.cell = cell
    self.vocab_size = vocab_size
    self.embedding_size = embedding_size
    self.hidden_size = hidden_size
    self.dropout_rate = dropout_rate
    
  def build_training_graph(self):
    
    self.embedding = tf.get_variable("embedding", [self.vocab_size, self.embedding_size], dtype=tf.float32)
    
    self.cell = tf.contrib.rnn.BasicLSTMCell(self.hidden_size, forget_bias=self.config['forget_bias'],
					state_is_tuple=True, reuse=self.reuse) 