# RNN for Text Generation
**Peter Kabai**

## Part 1 - Replicating Karpathy's Code

### Introduction
This project aims to replicate Karpathy's text generation code in TensorFlow using an RNN. The text that will be used is Alice in Wonderland by Lewis Carroll. To mimic the Karpathy code, each iteration will print text that's 200 characters long, and each input sequence will be 25 characters long.

### Enabling the GPU
Here the Colab GPU is enabled. It may not help for this particular RNN, but enabling it can't hurt. 

In [2]:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
    print('GPU device not found')
    print('Simply select "GPU" in the Accelerator drop-down in Notebook Settings')
else:
    print('Found GPU at: {}'.format(device_name))

GPU device not found
Simply select "GPU" in the Accelerator drop-down in Notebook Settings


### Importing Alice in Wonderland
Next we import the text file we will be using. In this case, it's Alice in Wonderland. The text file is imported, and the first 225 characters are printed below.

In [3]:
import requests
url = "https://raw.githubusercontent.com/peterkabai/tensorFlow/master/textFiles/aliceInWonderland.txt"
data = requests.get(url).text
print(data[:225])

﻿
ALICE'S ADVENTURES IN WONDERLAND

Lewis Carroll

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped


Next, the total number of characters and the number of unique characters is printed below. The number of unique characters is 71. When one-hot encoding is done later on, each character will be replaced by an array of 70 zeros and 1 one, so that the character indices are treated as categoricaal rather than numerical values.

In [4]:
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('The text has %d total characters and %d unique characters.' % (data_size, vocab_size))

The text has 147731 total characters and 71 unique characters.


### Sampling Sequences
Below, the Alice in Wonderland text is used to generate sequences of characters and sequences of characters shifted over one. X will contain the input sequence, and y will contain the sequence shifted over one character. This input sequence is actually the index of the character, rather than the character itself.

In [5]:
# the number of sequential characters to sample
length_of_input = 25

# maps each character to a number and each number to a character
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# create training sequences and corresponding labels
import numpy as np
X = []
y = []
for i in range(0, len(data)-length_of_input-1, 1):
    X.append([char_to_ix[ch] for ch in data[i:i+length_of_input]])
    y.append([char_to_ix[ch] for ch in data[i+1:i+length_of_input+1]])

# reshapes the data
X_modified = np.reshape(X, (len(X), length_of_input))
y_modified = np.reshape(y, (len(y), length_of_input))

This function below will help print the array of character indices as characters.

In [6]:
# function to front the characters that match an array of indices
def print_chars(indexArray):
    print("---------------------------------------------")
    string = ""
    for c in indexArray:
        if (c != None):
            string += ix_to_char.get(c)
    print(string)

### Graph Setup
Here the TensorFlow graph is created. To mimic the Karpathy code, the iterations (number of characters to generate per epoch) is set to 200. The length of the character sequences has already been set to 25 above.

In [13]:
tf.reset_default_graph()
import random
import warnings
warnings.filterwarnings("ignore")

# sets hyperparameters to mimic the Karpathy code
n_neurons = 50
num_batches = 5
iterations = 200
n_layers = 1
learning_rate = 0.005
sequence_to_use = 0
num_sequences = X_modified.shape[0]

# X has any num of batches and chars, and vocab_size due to one-hot encoding
X = tf.placeholder(tf.float32, [None, None, vocab_size])

# y has any num of batches, and 'length_of_input' characters
y = tf.placeholder(tf.int32, [None, length_of_input])

# more TensorFlow stuff defined here
layers = [tf.contrib.rnn.BasicRNNCell(num_units=n_neurons, activation=tf.nn.relu) for layer in range(n_layers)]
multi_layer_cell = tf.contrib.rnn.MultiRNNCell(layers)
outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32)
logits = tf.layers.dense(outputs, vocab_size)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
probs = tf.nn.softmax(logits)

# tf settings that are needed in both the session and result code blocks
saver = tf.train.Saver()
tf.logging.set_verbosity(tf.logging.ERROR)

# location to save the model
out_file = "./models/rnnTextGen/rnnTextGen.ckpt"

*Only run this when re-starting the epoch count!*

In [8]:
# the number of epochs that have been run saves here to be restored later
epoch_count = tf.Variable(0, name='epoch_count')

### TensorFlow Session
Next, all the components are brough togetherhere. Batches of 25 character sequences are appended to one another. Every time an iteration runs, a new character will be appended to the seed, and after every epoch the resulting 200 characters are printed. However, only the first 15 epochs' results are actually printed here. Below, one last epoch will be run to print out the final results. The first character of the 200 character output is picked randomly.

**Note:** this will run forever, and should be stopped with a keyboard inturrupt. The code block after this one will then run one last epoch to print the final result.

In [22]:
iterations_to_print = 30

init = tf.global_variables_initializer()
with tf.Session() as sess:
    
    # attempts to restore the saved session if avaliable
    tf.initialize_all_variables().run()
    try:
        saver.restore(sess, out_file)
        print("Session with", epoch_count.eval(), "epochs was restored...")
    except:
        init.run()
        print("New session started...")
        
    # epoch loop starts here
    while (True):
        
        # this helps extra prediction when they won't be printed anyway
        if (epoch_count.eval() < iterations_to_print):
            
            # creates a seed to start the string from
            pred_indices = [random.randint(0,vocab_size)]
            
            # each iteration is one new character
            for iteraton in range(0, iterations):
                
                # creates batches, each row is an array of sequential characters
                in_indices = []
                out_indices = []
                for batch in range(0, num_batches):
                
                    # if we run out of sequences, the sequence to use returns to 0
                    if (sequence_to_use >= num_sequences):
                        sequence_to_use = 0
                        
                    # in and out indices are appended to
                    # 'sequence_to_use' is incrimented to get the next sequence when re-run
                    in_indices.append(X_modified[sequence_to_use])
                    out_indices.append(y_modified[sequence_to_use])
                    sequence_to_use += 1
                
                # one hot encode the inputs (the outputs do not need encoding)
                X_encoded = tf.one_hot(np.asarray(in_indices), vocab_size).eval()
                
                # run the trainining op
                sess.run(training_op, feed_dict={X: np.asarray(X_encoded), y: np.asarray(out_indices)})
            
                # this helps extra prediction when they won't be printed anyway
                if (epoch_count.eval() < iterations_to_print):
                
                    # one hot encode the prediction indices
                    pred_encoded = tf.one_hot(np.asarray(pred_indices), vocab_size).eval()
                
                    # get predictions as probabilities
                    predictions = sess.run(probs, feed_dict={X: np.asarray([pred_encoded])})
                
                    # take the probabilities from the last character
                    # pick the next index using the probabilities
                    ix = np.random.choice(range(vocab_size), p=(predictions[0][-1]).ravel())
                
                    # add to the array of indices
                    pred_indices.append(ix)
            
        # print the string every epoch for the first 'iterations_to_print' epochs
        if (epoch_count.eval() < iterations_to_print):
            print_chars(pred_indices)
            
        # increment the epoch count used to see when to print the results
        epoch_count = epoch_count + 1
    
        # save the session
        save_path = saver.save(sess, out_file)

New session started...
---------------------------------------------
s eineeiTw'siesohasD i thwLiulrvsa*nndrthe wreehaw dv,iFhe eeoia lldtsgD﻿ e t weeseb re hetN sg n oodrro!X, 'tt thi;ljsoftthafitl ( aheie
---------------------------------------------
 d rindtteedeeth w d buhe sibagC*lo]mer ,ud boutlalAle auy beat pe perles
y foebleky, tohe wasto waad of
---------------------------------------------
ithen conm hcd wo angtditlin
Indefs bfaw
Dying,s supn she 
dwe fodlnd thagN)gughe Rur?Oled" nced fhaulscRite houd wit focimed'r roipoul yun nulwn n., intthe bayr
---------------------------------------------
Yd-, thi seinctu dolndt me merag wo
meng ti fousolr phery tos torne marl dfangtheTie, shotr thar thisas, steE'or!otoll
hew le dout
the he
de'r aronwisenle laswras Rulily hl kift Hwnut the she nl wo 
---------------------------------------------
y.u thitkeeehas megs To Lrake I ticn
a shoe yoc NeSn und-
there berey fe Nhithata
aryrllm yat cetPy ag anatrathing(o dats-emekenfs ga
s momls t

### Results
Since the results from all epochs were not all printed above, one last epoch is run below to print out the final result after training.

In [15]:
# for useful comments see code block above
with tf.Session() as sess:
    try:
        saver.restore(sess, out_file)
        print("Session was restored...")
    except:
        print("Session failed to be restored...")
        import sys
        sys.exit()
        
    pred_indices = [random.randint(0,vocab_size)]
    for iteraton in range(0, iterations):
        in_indices = []
        out_indices = []
        for batch in range(0, num_batches):
            if (sequence_to_use > num_sequences):
                sequence_to_use = 0
            in_indices.append(X_modified[sequence_to_use])
            out_indices.append(y_modified[sequence_to_use])
            sequence_to_use += 1
        X_encoded = tf.one_hot(np.asarray(in_indices), vocab_size).eval()
        sess.run(training_op, feed_dict={X: np.asarray(X_encoded), y: np.asarray(out_indices)})
        pred_encoded = tf.one_hot(np.asarray(pred_indices), vocab_size).eval()
        predictions = sess.run(probs, feed_dict={X: np.asarray([pred_encoded])})
        ix = np.random.choice(range(vocab_size), p=(predictions[0][-1]).ravel())
        pred_indices.append(ix)
    print_chars(pred_indices)

Session was restored...
---------------------------------------------
t dHP.qfH(baE.OBq.aq:.mC?-:b qkHP .?fP.R(bt.Mb_[ RcCBbTHPe?[ kcoPOmfC ?.Mq :LW .k.BOe,_ eXLii aH.[ mS.R
tt(q(f at.:?l  ft-OVe TE[ te .?CjYe t
OLbinuofMqm

?inpLbut  eftLOtqikfw lhoArMH. oin oM[?fii


## Part 2 - Further Exploration with Harry Potter

### Importing the Text

In [16]:
url = "https://raw.githubusercontent.com/peterkabai/tensorFlow/master/textFiles/harryPotter.txt"
data = requests.get(url).text
print(data[:298])

CHAPTER ONE 
THE BOY WHO LIVED 
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. 
