Based on:
https://github.com/ml4a/ml4a-guides/blob/master/notebooks/recurrent_neural_networks.ipynb

## Recurrent Neural Networks: Character RNNs with Keras

Often we are not interested in isolated datapoints, but rather datapoints within a context of others. A datapoint may mean something different depending on what's come before it. This can typically be represented as some kind of _sequence_ of datapoints, perhaps the most common of which is a time series.

One of the most ubiquitous sequences of data where context is especially important is natural language. We have quite a few words in English where the meaning of a word may be totally different depending on it's context. An innocuous example of this is "bank": "I went fishing down by the river bank" vs "I deposited some money into the bank".

If we consider that each word is a datapoint, most non-recurrent methods will treat "bank" in the first sentence exactly the same as "bank" in the second sentence - they are indistinguishable. If you think about it, in isolation they are indistinguishable to us as well - it's the same word!

We can only start to discern them when we consider the previous word (or words). So we might want our neural network to consider that "bank" in the first sentence is preceded by "river" and that in the second sentence "money" comes a few words before it. That's basically what RNNs do - they "remember" some of the previous context and that influences the output it produces. This "memory" (called the network's "_hidden state_") works by retaining some of the previous outputs and combining it with the current input; this recursing (feedback) of the network's output back into itself is where its name comes from.

This recursing makes RNNs quite deep, and thus they can be difficult to train. The gradient gets smaller and smaller the deeper it is pushed backwards through the network until it "vanishes" (effectively becomes zero), so long-term dependencies are hard to learn. The typical practice is to only extend the RNN back a certain number of time steps so the network is still trainable.

Certain units, such as the LSTM (long short-term memory) and GRU (gated recurrent unit), have been developed to mitigate some of this vanishing gradient effect.

Let's walkthrough an example of a character RNN, which is a great approach for learning a character-level language model. A language model is essentially some function which returns a probability over possible words (or in this case, characters), based on what has been seen so far. This function can vary from region to region (e.g. if terms like "pop" are used more commonly than "soda") or from person to person. You could say that a (good) language model captures the style in which someone writes.

Language models often must make the simplifying assumption that only what came immediately (one time step) before matters (this is called the "Markov assumption"), but with RNNs we do not need to make such an assumption.

We'll use Keras which makes building neural networks extremely easy (this example is an annotated version of Keras's [LSTM text generation example](https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py)).

First we'll do some simple preparation - import the classes we need and load up the text we want to learn from.

In [1]:
import os

#if using Theano with GPU
#os.environ["THEANO_FLAGS"] = "mode=FAST_RUN,device=gpu,floatX=float32"

import random
import numpy as np
from glob import glob
from keras.models import Sequential
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Activation, Dropout
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from tqdm import tqdm

import pandas as pd

Using TensorFlow backend.


In [2]:
%matplotlib inline
from matplotlib import pyplot as plt

In [3]:
# load up our text
all_jokes = pd.read_csv('shortjokes.csv')
all_jokes.head()

Unnamed: 0,ID,Joke
0,1,"[me narrating a documentary about narrators] ""..."
1,2,Telling my daughter garlic is good for you. Go...
2,3,I've been going through a really rough period ...
3,4,"If I could have dinner with anyone, dead or al..."
4,5,Two guys walk into a bar. The third guy ducks.


In [4]:
# let us create a long string variable text
text = '\n'.join(all_jokes.Joke)
text[0:500]

'[me narrating a documentary about narrators] "I can\'t hear what they\'re saying cuz I\'m talking"\nTelling my daughter garlic is good for you. Good immune system and keeps pests away.Ticks, mosquitos, vampires... men.\nI\'ve been going through a really rough period at work this week It\'s my own fault for swapping my tampax for sand paper.\nIf I could have dinner with anyone, dead or alive... ...I would choose alive. -B.J. Novak-\nTwo guys walk into a bar. The third guy ducks.\nWhy can\'t Barbie get pregn'

In [5]:
# extract all (unique) characters
# these are our "categories" or "labels". We want to predict the next character from the past few (e.g 20) characters

def removeChars(text):
  for char in ['\x08', '\x10', '~', '^']:
    text = text.replace(char, '')
  return text
  
all_jokes.Joke = all_jokes.Joke.apply(lambda x: removeChars(x))

text = '\n'.join(all_jokes.Joke)
text[0:500]

chars = list(set(text))

all_jokes.Joke = all_jokes.Joke.apply(lambda x: '~' + x + '^')
print(sorted(chars))

['\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}']


Now we'll define our RNN. Keras makes this trivial:

We're framing our task as a classification task. Given a sequence of characters, we want to predict the next character. We equate each character with some label or category (e.g. "a" is 0, "b" is 1, etc).

We use the _softmax_ activation function on our output layer - this function is used for categorical output. It turns the output into a probability distribution over the categories (i.e. it makes the values the network outputs sum to 1). So the network will essentially tell us how strongly it feels about each character being the next one.

The categorical cross-entropy loss the standard loss function for multilabel classification, which basically penalizes the network more the further off it is from the correct label.

We use dropout here to prevent overfitting - we don't want the network to just return things already in the text, we want it to have some wiggle room and create novelty! Dropout is a technique where, in training, some percent (here, 20%) of random neurons of the associated layer are "turned off" for that epoch. This prevents overfitting by preventing the network from relying on particular neurons.

That's it for the network architecture!

To train, we have to do some additional preparation. We need to chop up the text into character sequences of the length we specified (`max_len`) - these are our training inputs. We match them with the character that immediately follows each sequence. These are our expected training outputs.

For example, say we have the following text (this quote is from Zhuang Zi). With `max_len=20`, we could manually create the first couple training examples like so:

We also need to map each character to a label and create a reverse mapping to use later:

In [6]:
char_labels = {ch:i+3 for i, ch in enumerate(chars)}
labels_char = {i+3:ch for i, ch in enumerate(chars)}
# Padding Char:
char_labels['PAD'] = 0
labels_char[0] = 'PAD'
# Start Char
char_labels['~'] = 1
labels_char[1] = '~'
# End Char
char_labels['^'] = 2
labels_char[2] = '^'

In [7]:
z = sorted(list(char_labels.values()))
for i, k in enumerate(z):
  if i != k:
    print(i)

In [8]:
print(char_labels)

{'q': 3, 'd': 4, 'P': 5, 'l': 6, 'J': 7, '#': 8, 'f': 9, 'i': 10, '9': 11, '?': 12, 'I': 13, '&': 14, '2': 15, ':': 16, 'c': 17, '$': 18, 'h': 19, '0': 20, 'O': 21, 'n': 22, 'm': 23, '_': 24, ',': 25, '*': 26, 'H': 27, 'Z': 28, '8': 29, '%': 30, '+': 31, '|': 32, ';': 33, 'y': 34, 'j': 35, '!': 36, '@': 37, 'X': 38, ' ': 39, 'p': 40, '3': 41, '<': 42, 'B': 43, '>': 44, '"': 45, 's': 46, 'a': 47, '7': 48, 't': 49, '/': 50, 'g': 51, 'w': 52, 'E': 53, 'b': 54, 'G': 55, 'V': 56, '1': 57, 'u': 58, ')': 59, 'o': 60, 'U': 61, '\\': 62, 'N': 63, '`': 64, 'k': 65, '-': 66, 'W': 67, 'L': 68, 'S': 69, 'v': 70, '\n': 71, '{': 72, 'C': 73, 'r': 74, 'T': 75, 'M': 76, '(': 77, '5': 78, '4': 79, 'R': 80, 'F': 81, 'e': 82, 'z': 83, 'Q': 84, "'": 85, '=': 86, 'x': 87, 'K': 88, '.': 89, ']': 90, 'D': 91, '}': 92, 'Y': 93, 'A': 94, '[': 95, '6': 96, 'PAD': 0, '~': 1, '^': 2}


Now we can start constructing our numerical input 3-tensor and output matrix. Each input example (i.e. a sequence of characters) is turned into a matrix of one-hot vectors; that is, a bunch of vectors where the index corresponding to 

In [9]:
print(len(char_labels))

97


# Creating the model here


In [16]:
### Creating X Data
# max_len = 200
# x_data = np.array([np.array([char_labels[char] for char in z[:-1]]) for z in all_jokes.Joke])
# input_sequences = np.array(pad_sequences(x_data,   
#                             maxlen=max_len, padding='post'))


# y_data = np.array([np.array([char_labels[char] for char in z[1:]]) for z in all_jokes.Joke])
# true_output = np.array(pad_sequences(y_data,   
#                             maxlen=max_len, padding='post'))


x_data = [[char_labels[char] for char in z] for z in all_jokes.Joke]
max_len=20
input_sequences = []
labels = []
for seq in tqdm(x_data[:100000]):
  seq = [1]*max_len + seq + [2]
  for i in range(len(seq)):
    if i+max_len < len(seq):
      input_sequences.append(to_categorical(seq[i:i+max_len], num_classes=len(char_labels)))
      labels.append(seq[i+max_len])
input_sequences = np.array(input_sequences)
labels = to_categorical(labels, num_classes=len(char_labels))

100%|██████████| 300/300 [00:00<00:00, 781.31it/s]


In [12]:
input_sequences.shape

(28149, 20, 97)

In [None]:
# X = np.array([to_categorical(seq, num_classes=len(char_labels)) for seq in input_sequences]) 
# y = to_categorical(labels, num_classes=len(char_labels))

In [13]:
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(max_len, len(char_labels))))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(len(char_labels)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 20, 512)           1249280   
_________________________________________________________________
dropout_1 (Dropout)          (None, 20, 512)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 512)               2099200   
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 97)                49761     
_________________________________________________________________
activation_1 (Activation)    (None, 97)                0         
Total params: 3,398,241
Trainable params: 3,398,241
Non-trainable params: 0
_________________________________________________________________


In [17]:
model.fit(input_sequences, labels, batch_size=32, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
 3680/28149 [==>...........................] - ETA: 2:51 - loss: 2.5391 - acc: 0.3071

KeyboardInterrupt: 

In [18]:

# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
  in_text = seed_text
  # generate a fixed number of characters
  for _ in range(n_chars):
    # encode the characters as integers
    encoded = [mapping[char] for char in in_text]
    # truncate sequences to a fixed length
    encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
    # one hot encode
    encoded = to_categorical(encoded, num_classes=len(mapping))
    # predict character
    yhat = model.predict_classes(encoded, verbose=0)
    # reverse map integer to character
    out_char = ''
    for char, index in mapping.items():
      if index == yhat:
        out_char = char
        break
    # append to input
    in_text += char
  return in_text

In [21]:
result = generate_seq(model, char_labels, 20, 'the man ran into the bar', 200)

In [22]:
result

'the man ran into the bare an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an an '

In [None]:
# import tensorflow as tf
# from keras.utils import to_categorical

# class CharRNN:
#   def __init__(self):
#     self.global_step = tf.train.get_or_create_global_step()
#     self.build_model()
    
    
#   def build_model(self):
#     with tf.variable_scope('Initialization'):
#       self.x = tf.placeholder(tf.float32, shape=[None, None, 97], name='Features')
#       self.y = tf.sparse_placeholder(tf.int32, name='Labels')
#       self.seq_lengths = tf.placeholder(tf.int32, shape=[None,], name='Sequence_Lengths')
      
#     with tf.variable_scope('LSTM'):
#       lstm = tf.contrib.cudnn_rnn.CudnnLSTM(num_layers=2, 
#                                             num_units=512)
#       lstm_output, _ = lstm(tf.transpose(self.x, [1,0,2]))
      
#       logits = tf.layers.dense(lstm_output, units=97)
    
#     with tf.variable_scope('Optimization'):
#       optimizer = tf.train.AdamOptimizer(0.0001)
#       loss = tf.nn.ctc_loss(inputs=logits, labels=self.y, 
#                               sequence_length=self.seq_lengths, time_major=True)
      
#       self.loss = tf.reduce_mean(loss)
#       self.train_step = optimizer.minimize(self.loss, global_step=self.global_step)

      
# def padDataBatch(data):
#       seq_lengths = np.array([utter.shape[0] for utter in data])
#       maxlen = max(seq_lengths)
#       print(maxlen)
#       result = np.array([np.pad(utter, ((0, maxlen-utter.shape[0])), mode='constant') for utter in data])
#       return to_categorical(result, num_classes=97)
    
    
# def convert_to_sparse(labels):
#     indices, values = [], []
#     ind_x = np.zeros(3000)
#     ind_y = np.arange(3000)
#     maxlen = 0 
#     for i, label in enumerate(labels):
#       length = label.size
#       indices.append(np.stack([ind_x[:length]+i, ind_y[:length]], axis=1))
#       values.append(label)
#       maxlen = max(len(label), maxlen)
#     return np.concatenate(indices, axis=0), np.concatenate(values, axis=0), np.array([len(labels), maxlen])
    
    
# def train(sess, model, num_epochs, input_seq, seq_lengths, output_seq):
#   batch_size = 4
#   data_size = input_seq.shape[0]
#   all_idxs = np.arange(data_size)

#   for k in range(num_epochs):
#     print("Starting Epoch ", k)
#     np.random.shuffle(all_idxs)
#     i = 0
#     while i < len(all_idxs):
#       idxs = all_idxs[i:i+batch_size]
#       i += batch_size

#       batch_data = padDataBatch(input_seq[idxs])
#       batch_lens = seq_lengths[idxs]
#       batch_y = convert_to_sparse(output_seq[idxs])
#       loss = sess.run(model.loss, feed_dict={model.x:batch_data, model.y:batch_y, model.seq_lengths:batch_lens})
#       print(loss)
# #       return batch_data
        

In [None]:
# sess = tf.Session()
# model = CharRNN()
  
# a = train(sess, model, 1, x_data, seq_lengths, y_data)