# Text Generation

In this task, text generation will be performed using character-level LSTM.

I used several sources of knowledge. Firstly, Google&Udacity ["Deep Learning" course](https://www.udacity.com/course/deep-learning--ud730) and Andrej Karpathy [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). Secondly, [my own experience](https://github.com/ne3x7) of taking Deep NLP Course.

In [1]:
import numpy as np
import pandas as pd
import random as rnd
import seaborn as sns
import tensorflow as tf
import os

%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Load data

I had pretty clean concatenated Dostojevsky texts, so I decided to use this file as source. Since we are building a character-level LSTM, vocabulary is built on set of characters and no embedding is needed.

In [2]:
def read_file(filename):
    """
    Reads source text file into memory.
    
    Args:
        filename (str): A file in ./data/ folder.
        
    Returns:
        raw: A string representation of source text.
    """
    assert os.path.exists('./data/%s' % filename), 'File %s not found in folder ./data' % filename
    with open('./data/%s' % filename, 'r') as fin:
        raw = fin.read()
    return raw.lower()

def build_corpus(data, valid_size_pct=0.1):
    """
    Builds character-index mapping, performs trin/test split, creates corpus as a list of indexes.
    
    Args:
        data (str): A string representation of source text.
        valid_size_pct (float): Percentage of data to keep for validation.
        
    Returns:
        vocabulary_size (int): Vocabulary size.
        valid_size (int): Validatons set size.
        char2id: A dict mapping characters to indexes.
        id2char: A dict mapping indexes to characters.
        train_corpus: An index representation of train text.
        valid_corpus: An index representation of validation text.
    """
    data = data.decode('utf-8')
    total_size = len(data)
    valid_size = int(valid_size_pct * total_size)
    train_size = total_size - valid_size
    valid_text = data[:valid_size]
    train_text = data[valid_size:]
    
    print 'Using validation size %d' % valid_size
    
    valid_corpus = []
    train_corpus = []
    char2id = {}
    id2char = {}
    
    vocabulary = set(data)
    vocabulary_size = len(vocabulary)
    
    print 'Using vocabulary size %d' % vocabulary_size
    
    for i, c in enumerate(vocabulary):
        char2id[c] = i
        id2char[i] = c
        
    for c in train_text:
        train_corpus.append(char2id[c])
        
    for c in valid_text:
        valid_corpus.append(char2id[c])
        
    return vocabulary_size, valid_size, char2id, id2char, train_corpus, valid_corpus

In [3]:
raw = read_file('dostoevskiy_only_text.txt')
print 'Data size: %d' % len(raw)
print 'Example: \'%s\'' % raw[:64]

vocabulary_size, valid_size, char2id, id2char, train_data, valid_data = build_corpus(raw, 0.01)

print 'Same example: \'%s\'' % valid_data[:64]
print 'Part of vocabulary: %s' % [id2char[i] for i in range(50)]

Data size: 11396681
Example: 'Ох уж эти мне сказочники! Нет чтобы '
Using validation size 63754
Using vocabulary size 139
Same example: '[73, 120, 48, 46, 84, 48, 122, 82, 87, 48, 15, 116, 14, 48, 117, 80, 123, 42, 81, 47, 116, 87, 80, 87, 108, 48, 106, 14, 82, 48, 47, 82, 81, 113, 49, 48, 116, 123, 44, 87, 117, 123, 82, 19, 48, 47, 82, 81, 0, 116, 87, 113, 46, 12, 19, 48, 44, 81, 43, 14, 42, 116, 81, 14]'
Part of vocabulary: [u'-', u'\u040c', u'\u0410', u'\u0414', u'\u0418', u'\u041c', u'\u0420', u'\u0424', u'\u0433', u'(', u',', u'0', u'\u0434', u'\xb7', u'\u0435', u'\u043c', u'\u0440', u'\u0444', u'\u0448', u'\u044c', u'9', u'b', u'2', u';', u'd', u'h', u'l', u'\xef', u'p', u'?', u'x', u'\u0301', u'\u0413', u'\u0417', u'\u041b', u'\u041f', u'\u0423', u'\u0427', u'\u042b', u'4', u'\u042f', u'3', u'\u0437', u'\u043b', u'\u043f', u't', u'\u0443', u'\u0447', u' ', u'\u044b']


## Character-level LSTM

Long Short Term Memory Networks introduce a concept of remembering and forgetting to vanilla Recurrent Neural Networks, which allows them to keep track of longer relations in the sequence. This is very important for the character-level RNNs.

The way we use it here is that we feed the model a symbol and ask it to predict the next one based on its state and he input itself. The output of the model is sent through softemax, so it can be treated as a proper normalized categrical probability distribution. During training, we use this distribution to compute cross-entropy between model outputs and true values. During validation and testing, we draw a symbol according to this distribution and trreat it as a new input.

During training, dropout is used to force the model to rely on data input. To measure the performance of the model, perplexity metric is computed between model outputs and true input.

In [4]:
batch_size = 128
trunc_by = 50

class BatchGenerator(object):
    """
    Batch generation assistant class.
    """
    def __init__(self, text, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size
        self._cursor = [offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()
  
    def _next_batch(self):
        """
        Generates a single batch from the current cursor position in the data.
        """
        batch = np.zeros(shape=(self._batch_size), dtype=np.float)
        for b in range(self._batch_size):
            batch[b] = self._text[self._cursor[b]]
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch
  
    def next(self):
        """
        Generates the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = [self._last_batch]
        for step in range(self._num_unrollings):
            batches.append(self._next_batch())
        self._last_batch = batches[-1]
        return batches
    
def logprob(predictions, labels):
    """
    Computes scaled cross-entropy between `predictions` and `labels`.
    
    Args:
        predictions (list): Predictions, should not have zeros.
        labels (list): Labels, may have zeros.
        
    Returns:
        Scaled cross-entropy.
    """
    true = np.zeros_like(predictions, dtype=np.float)
    for i in range(len(labels)):
        true[i, labels[i]] = 1.0
    predictions[predictions < 1e-10] = 1e-10
    return np.sum(np.multiply(true, -np.log(predictions))) / true.shape[0]

def perplexity(predictions, labels):
    """
    Computes perpexity between `predictions` and `labels`.
    
    Args:
        predictions (list): Predictions, should not have zeros.
        labels (list): Labels, may have zeros.
        
    Returns:
        Perplexity.
    """
    return np.exp(logprob(predictions, labels))

def sample(distribution):
    """
    Sample according to categorical `distrbution`.
    
    Args:
        distribution (list): A proper list of probabilities.
        
    Returns:
        Sampled value.
    """
    r = rnd.uniform(0, 1)
    s = 0
    
    for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
            return i
    
    return len(distribution) - 1

In [5]:
train_batches = BatchGenerator(train_data, batch_size, trunc_by)
valid_batches = BatchGenerator(valid_data, 1, 1)

In [None]:
graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):
    # Input data
    train_data = [tf.placeholder(tf.int32, shape=[batch_size]) for i in range(trunc_by+1)]
    train_inputs = train_data[:-1]
    train_labels = train_data[1:]

    # Variables
    # Input gate
    input_weights = tf.Variable(tf.truncated_normal([2 * vocabulary_size, vocabulary_size], -0.1, 0.1),
                                name='input_weights')
    input_biases = tf.Variable(tf.truncated_normal([1, vocabulary_size], -0.1, 0.1),
                               name='input_biases')
    # Forget gate
    forget_weights = tf.Variable(tf.truncated_normal([2 * vocabulary_size, vocabulary_size], -0.1, 0.1),
                                 name='forget_weights')
    forget_biases = tf.Variable(tf.truncated_normal([1, vocabulary_size], -0.1, 0.1),
                                name='forget_biases')
    # State cell
    cell_weights = tf.Variable(tf.truncated_normal([2 * vocabulary_size, vocabulary_size], -0.1, 0.1),
                               name='cell_weights')
    cell_biases = tf.Variable(tf.truncated_normal([1, vocabulary_size], -0.1, 0.1),
                              name='cell_biases')
    # Hidden state updates
    hidden_weights = tf.Variable(tf.truncated_normal([vocabulary_size, vocabulary_size], -0.1, 0.1),
                                 name='hidden_weights')
    hidden_biases = tf.Variable(tf.truncated_normal([1, vocabulary_size], -0.1, 0.1),
                                name='hidden_biases')
    # Output gate
    output_weights = tf.Variable(tf.truncated_normal([2 * vocabulary_size, vocabulary_size], -0.1, 0.1),
                                 name='output_weights')
    output_biases = tf.Variable(tf.truncated_normal([1, vocabulary_size], -0.1, 0.1),
                                name='output_biases')
    # Connections
    saved_state = tf.Variable(tf.truncated_normal([batch_size, vocabulary_size], -0.1, 0.1),
                              name='saved_state')
    saved_h = tf.Variable(tf.truncated_normal([batch_size, vocabulary_size], -0.1, 0.1),
                          name='saved_h')
    # Softmax
    softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, vocabulary_size]),
                                  name='softmax_weights')
    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]),
                                 name='softmax_biases')
    
    # Cell
    def lstm_cell(w, h, state):
        i = tf.concat([w, h], 1)
        input_gate = tf.sigmoid(tf.matmul(i, input_weights) + input_biases)
        forget_gate = tf.sigmoid(tf.matmul(i, forget_weights) + forget_biases)
        output_gate = tf.sigmoid(tf.matmul(i, output_weights) + output_biases)
        state_update = tf.tanh(tf.matmul(i, cell_weights) + cell_biases)
        state = forget_gate * state + input_gate * state_update
        state_update = tf.tanh(tf.matmul(state, hidden_weights) + hidden_biases)
        h = output_gate * state_update
        return h, state

    # Model
    hs = []
    h = saved_h
    state = saved_state
    for w in train_inputs:
        one_hot_w = tf.one_hot(w, vocabulary_size)
        sparse_w = tf.nn.dropout(one_hot_w, keep_prob=0.5)
        h, state = lstm_cell(sparse_w, h, state)
        hs.append(h)
        
    with tf.control_dependencies([saved_h.assign(h), saved_state.assign(state)]):
        sparse_h = tf.nn.dropout(tf.concat(hs, 0), keep_prob=0.5)
        logits = tf.nn.xw_plus_b(sparse_h, softmax_weights, softmax_biases)
        true = tf.one_hot(indices=tf.concat(train_labels, 0), depth=vocabulary_size)
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf.concat(true, 0), logits=logits))
    
    # Optimizer
    gs = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(1e-1, gs, 4000, 0.1, staircase=True)
    optimizer = tf.train.RMSPropOptimizer(learning_rate)
    
    # Gradients
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=gs)

    # Prediction
    train_prediction = tf.nn.softmax(tf.nn.xw_plus_b(tf.concat(hs, 0), softmax_weights, softmax_biases))
    
    # Evaluation
    sample_input = tf.placeholder(tf.int32, [1])
    saved_sample_h = tf.Variable(tf.zeros([1, vocabulary_size]), name='saved_sample_h')
    saved_sample_state = tf.Variable(tf.zeros([1, vocabulary_size]), name='saved_sample_state')
    
    reset_sample_state = tf.group(saved_sample_h.assign(tf.zeros([1, vocabulary_size])),
                                  saved_sample_state.assign(tf.zeros([1, vocabulary_size])))
    
    tf.add_to_collection('ops', reset_sample_state)
    
    one_hot_sample_input = tf.one_hot(sample_input, vocabulary_size)
    sample_h, sample_state = lstm_cell(one_hot_sample_input, saved_sample_h, saved_sample_state)
    with tf.control_dependencies([saved_sample_h.assign(sample_h), saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_h, softmax_weights, softmax_biases))
        tf.add_to_collection('ops', sample_prediction)
        
    # Saving
    to_save = {'input_weights': input_weights, 'input_biases': input_biases,
               'forget_weights': forget_weights, 'forget_biases': forget_biases,
               'cell_weights': cell_weights, 'cell_biases': cell_biases,
               'hidden_weights': hidden_weights, 'hidden_biases': hidden_biases,
               'output_weights': output_weights, 'output_biases': output_biases,
               'saved_state': saved_state, 'saved_h': saved_h,
               'softmax_weights': softmax_weights, 'softmax_biases': softmax_biases,
               'saved_sample_h': saved_sample_h, 'saved_sample_state': saved_sample_state}
    
    saver = tf.train.Saver(to_save)

In [None]:
num_steps = 50001
print_every = 500
losses = []

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    
    average_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        labels = np.concatenate(batches[1:])
        feed_dict = {}
        for i in range(trunc_by + 1):
            feed_dict[train_data[i]] = batches[i]

        opt, l, pred, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        average_loss += l
        if step % print_every == 0:
            if step > 0:
                average_loss = average_loss / print_every
            losses.append(average_loss)
            print 'Average loss at step %d: %.2f learning rate %e' % (step, average_loss, lr)
            average_loss = 0
            labels = np.concatenate(batches[1:])
            print 'Minibatch perplexity: %e' % perplexity(pred, labels)
                  
            if (step % (print_every * 5)) == 0:
                print('=' * 80)
                print 'Generation examples:'
                print('=' * 80)
                for _ in range(5):
                    s = rnd.randint(0, vocabulary_size-1)
                    sentence = id2char[s]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: [s]})
                        s = sample(prediction.reshape(shape(prediction)[1],))
                        sentence += id2char[s]
                    print sentence
                print('=' * 80)
                
                saver.save(session, './model/model')
            
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print 'Validation set perplexity: %e\n' % np.exp(valid_logprob / valid_size)
            if len(losses) > 2 and abs(losses[-2] - losses[-1]) < 1e-5:
                print 'Converged'
                break

Initialized
Average loss at step 0: 37.04 learning rate 1.000000e-01
Minibatch perplexity: 6.092779e+07
Generation examples:
]6у(}lЗМ!aЯМ:Ё(ЁЩЙslтl9((=®ф·l&ЁЫМЕЁфЁm,ЦаБôб('ЁoЁы(ЁМиМ[ф/Ёу6Ь)/иcМЪ(́фoЁ@МфМ9l




Й=М(Е(яЁbЁр(8М"Ж=y^=~М·Ёлb((эЖ·Й@ЁЕщФЁиЁя)pМр6уa[Ж*lЖ6~'р(wЖмЖДéЗщШф4Ж;]sиtl9Мё!
x=жЖelВижМлЖxфб(fcdМт)лМОЁьЁЛЁ,Ж(М7ЁeМcМЗЖАМp)юlèlШ(kЁХЁ&Ё{Мh6alШé́édфБМЧаР)АЖзЁ
еbЉиf6Офу([)~ЖЮЁЉЖ́6=Ё1фгЁ́(ТЖРl0(хЖw6wаéМ7(gМЧфйЁnЁцЁмМrЖфЁДЁâМ>Ж;иНхyaвЁ~МmЖ?Ж
,)ДиЭхЬ)l(uМ?фКМ(ЙюЁ)6jаxф4cИЁjЖцl66_l{иПМtа@И]6cЁêЁbхk6ьlзМШМрИ0ЖНМ4М!Жмщp(wЁt6
Validation set perplexity: 1.795015e+03

Average loss at step 500: 9.59 learning rate 1.000000e-01
Minibatch perplexity: 1.541909e+01
Validation set perplexity: 1.316455e+01

Average loss at step 1000: 2.81 learning rate 1.000000e-01
Minibatch perplexity: 1.537347e+01
Validation set perplexity: 1.519365e+01

Average loss at step 1500: 2.80 learning rate 1.000000e-01
Minibatch perplexity: 1.520771e+01
Validation set perplexity: 1.593898e+01

Average loss at step 2000: 2.80 learning rate 1.000000e-01
Minibatch perplexity: 1.445658e+01
Validation set perplexity: 1.620861e+01

Average loss at step 2500: 2.80 learning rate 1.000000e-01
Minibatch perplexity: 1.548256e+01
Ge

In [None]:
plt.figure(figsize=(20, 10))
plt.semilogy(np.linspace(0, len(losses) * 100, len(losses)), losses, label='Loss')
sns.
plt.xlabel('Epoch', fontsize='xx-large')
plt.ylabel('Loss (cross-entropy)', fontsize='xx-large')
plt.legend(fontsize='xx-large')

I am learning on CPU and time is nearly up, so it's not easy to train a good model. Anyway, generating some examples:

In [9]:
with tf.Session(graph=graph) as session:
    loader = tf.train.import_meta_graph('./model/model.meta')
    loader.restore(session, './model/model')
    
    ops = tf.get_collection('ops')
    reset_sample_state = ops[0]
    sample_prediction = ops[1]
    
    for _ in range(5):
        s = rnd.randint(0, vocabulary_size-1)
        sentence = id2char[s]
        reset_sample_state.run()
        for _ in range(79):
            prediction = sample_prediction.eval({sample_input: [s]})
            s = sample(prediction.reshape(shape(prediction)[1],))
            sentence += id2char[s]
        print '-->', sentence

aôепрештлы, о И нимосмлтсне ч, даватокув каз влек о жнешпыновэтиг бовал позом.пу
Мôо уххонщть м,житаянодн, о иех измсн, тот, новстуе ии! с Ь)- Сшотужат, по озанк
Ьôоо, не итиря ео мызявис ннира я былетое дыавнизреововннны ь яывивнх, ослииуий 
иЫ лертмал, три парзмоги, ет спусит м в гоеел у роеовлиеуя и ярмакала с, слошона
iЌерапрелч. 1 т-поо.стня н жизьл чеосмоео..... вбио ажало! этосе зчму сагосает с
