For a bit of a primer on text generating LSTMs, I find that [this](https://towardsdatascience.com/lstm-by-example-using-tensorflow-feb0c1968537) post does a good job at introducing the code using `TensorFlow` rather well

In [1]:
import sys
sys.path.append("..")

import tensorflow as tf
import nltk
import random
import datetime

from data_encoders.text_encoder import EncodingProletariat

### Data
Again, we are going to use Jane Austen's *Emma* as our first dataset. 

In [2]:
print(nltk.corpus.gutenberg.fileids())
emma = nltk.corpus.gutenberg.words('austen-emma.txt')

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


### Parameter Setting 
There are a couple parameters that will be set as variables through this notebook. These will eventually turn into class parameters

In [3]:
num_inputs = 3    # number of input columns
batch_size = 100  # the number of rows per batch


We need to pull in the `EncodingProletariat` class as this is the data feeder of the neural network.

In [4]:
ep = EncodingProletariat(emma, num_inputs=num_inputs)
ep.encodings_x

array([[3144,  107,  648],
       [ 107,  648, 4509],
       [ 648, 4509, 2023],
       ...,
       [ 297, 2030, 1356],
       [2030, 1356, 7267],
       [1356, 7267,  137]])

The number of classes parameter is just the length of the entire vocabulary.

In [5]:
num_classes = len(ep.vocab_dict)

### Building the Graph
The next step is to start building out the graph. The `inputs` tensor is a 2-dimensional tensor of the size `batch_size` by `num_inputs`. The `targets` tensor is a 2-d tensor of the size `batch_size` by `num_classes`. You may remember that `num_classes` is the size of the vocabulary. What this will turn into is a one-hot encoded array where each target will be marked with a `1` on the index of their dictionary index and `0`s elsewhere. 

In [6]:
inputs = tf.placeholder(tf.int32, [batch_size, num_inputs], name="input")
targets = tf.placeholder(tf.float16, [batch_size, num_classes], name="target")

##### Word Embeddings
Right now, the inputs are represented as encoded arrays where each word is represented by an number pointing to the the index of the word in a vocab dictionary (`EncodingProletariat.vocab_dict`). The next method will create embeddings for each word of the length `embed_dim` and return the embeddings.

In [7]:
def embed(inputs, batch_size, num_inputs, embed_dim, vocab_size):
    with tf.name_scope("word_embeddings"):
        word_vectors = tf.Variable(tf.random_uniform([vocab_size, embed_dim], -1, 1, seed=datetime.datetime.now().microsecond))
        embeddings = tf.nn.embedding_lookup(word_vectors, inputs)
        
    return embeddings

In [8]:
embed(ep.encodings_x[0:100], batch_size=batch_size, num_inputs=num_inputs, embed_dim=10, vocab_size=num_classes)

<tf.Tensor 'word_embeddings/embedding_lookup/Identity:0' shape=(100, 3, 10) dtype=float32>

In [9]:
def build_cell(hidden_dim, num_layers, dropout):
    cells = []
    for _ in range(num_layers):
        cell = tf.contrib.rnn.LSTMCell(hidden_dim)
        dropout_cell = tf.contrib.rnn.DropoutWrapper(cell, input_keep_prob=dropout, output_keep_prob=dropout)
        cells.append(dropout_cell)
        cell = tf.contrib.rnn.MultiRNNCell(cells)
        return cell

In [10]:
build_cell(10, 2, .5)

<tensorflow.python.ops.rnn_cell_impl.MultiRNNCell at 0x12977d2b0>