Character-level RNN Language Model written in Tensorflow.

The purpose of this notebook is to demonstrate how we could build Recurrent Neural Network for character language modeling and train them on the George R.R. Martin's Game of Thrones novel series. We will show how this learned language model to able to generate sequences of character that constitute story text with similar style as the GoT's novel text. 

In [1]:
# import all required libraries
import numpy as np
import tensorflow as tf

import collections

In [2]:
# define a function to load and preprocess the text corpus then return list of chars
def read_file(path):
    with open(corpus_path) as f:
        char_tokens = ['*STOP*']
        text = f.read()
        char_tokens.extend(text)
        
        for i in range(len(char_tokens)):
            if char_tokens[i] == '\n':
                char_tokens[i] = '*STOP*'
        
        return char_tokens

In [3]:
def build_dataset(tokens):
    counts = []
    counts.extend(collections.Counter(tokens).most_common())
    
    dictionary = dict()
    data = list()
    
    for token, _ in counts:
        dictionary[token] = len(dictionary)
        
    for token in tokens:
        data.append(dictionary[token])
        
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    
    return data, dictionary, reverse_dictionary

In [4]:
def generate_batch(dataset, batch_size, num_steps, offset=0):
    assert offset + batch_size * num_steps < len(dataset)
    
    batch_context = np.ndarray((batch_size, num_steps), dtype=np.int32)
    batch_target = np.ndarray((batch_size, num_steps), dtype=np.int32)
    
    for i in range(batch_size):
        batch_context[i] = dataset[offset : offset+num_steps]
        batch_target[i] = dataset[offset+1 : offset+num_steps+1]
        offset += num_steps
        
    return batch_context, batch_target, offset

In [5]:
# define parameters of the program
corpus_path = '../data/got_all_edited.txt'

num_epoch = 30

batch_size = 5
num_steps = 60
embedding_size = 100

hidden_unit_size = 256
vocabulary_size = 20000
learning_rate = 1e-4

In [10]:
tokens = read_file(corpus_path)
data, tokendict, tokendictreversed = build_dataset(tokens)

vocabsize = len(tokendict)

In [11]:
train, label, _ = generate_batch(data, batch_size, num_steps)
for batch_train, batch_label in zip(train, label):
    print ''.join([tokendictreversed[token] for token in batch_train]) + ' --> '
    print ''.join([tokendictreversed[word] for word in batch_label])
    print '----------'

*STOP*"We should start back," Gared urged as the woods began to g --> 
"We should start back," Gared urged as the woods began to gr
----------
row dark around them. "The wildlings are dead."*STOP*"Do the dead --> 
ow dark around them. "The wildlings are dead."*STOP*"Do the dead 
----------
 frighten you?" Ser Waymar Royce asked with just the hint of --> 
frighten you?" Ser Waymar Royce asked with just the hint of 
----------
 a smile.*STOP*Gared did not rise to the bait. He was an old man, --> 
a smile.*STOP*Gared did not rise to the bait. He was an old man, 
----------
 past fifty, and he had seen the lordlings come and go. "Dea --> 
past fifty, and he had seen the lordlings come and go. "Dead
----------


In [13]:
graph = tf.Graph()
with graph.as_default():
    pass

In [16]:
# with open(corpus_path) as f:
#     edited = []
#     lines = f.readlines()
#     for line in lines:
#         x = line.replace('`',"'").replace('”','"').replace("“", '"')
#         if not x == '\n':
#             if x[len(x)-2].isalpha() or x[len(x)-2] == ',':
#                 x = x.replace('\n', ' ')
#         edited.append(x)
        
# with open('../data/got_all_edited.txt', 'w') as f:
#     f.write(''.join(edited))