# Homework 5 - Text Mining Shakespeare

The goal of this project is to generate new text from various Shakespeare plays. We use Recurrent Neural Networks (RNNs) to do so. To assess the quality of our model, we compare the generated text to Shakespeare's originals.

In [2]:
# All imports
import tensorflow as tf
import os.path
import numpy as np
from functools import reduce
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys

Using TensorFlow backend.


## Data

The data comes from Project Gutenberg. We downloaded the following books manually from the [Gutenberg website](https://www.gutenberg.org/ebooks/search/?query=shakespeare):

Title | ID
------|----
Romeo and Juliet | 1112
Hamlet | 1524
Macbeth | 2264
Midsummer Night's Dream| 2242
The Tempest | 23042
Othello | 2267
The Tragedy Of Julius Caesar | 1120
The Tragedy Of King Lear | 1128
Twelth Night; Or, What You Will | 1526
As You Like It | 1121
The Taming Of The Shrew | 1107
Henry V | 2253
King Richard III | 1103

The raw text files are stored in the subfolder `./data` of this project. We first write a helper class to read a single text, encode the text as an integer array and translate an integer array back to text.

In [51]:
class shakespeare_text():
    """
    Contains all data and methods for a given Shakespeare text.
    """
    def __init__(self, id):
        #Contains entire text of book
        self.text = self.__read_text_(id)
        
        self.text_length = len(self.text)
        #Characters in book
        self.vocab = sorted(set(self.text))
        #Dictionary of character to index mapping
        self.vocab_to_int = {c: i for i, c in enumerate(self.vocab)}
        #Total characters in book
        self.vocab_length = len(self.vocab)
        #Dictionary of index to character mapping
        self.int_to_vocab = dict(enumerate(self.vocab))
        #Characters in text mapped to indexes as numpy array
        self.array = np.array([self.vocab_to_int[c] for c in self.text], dtype=np.int32)
        #store batch size to object instance
        self.batch_size = 40
        self.step_size = 3


    def __read_text_(self, id):
        """
        Reads text corresponding to one of the ids.

        :param id: string
        """
        text_path = os.path.join('data', 'pg' + id + '.txt')
        with open(text_path) as file:
            txt = file.read()
            
        #Contains starting index of text for each book
        #Used to remove copyright information from the text
        pos_dict = {  '1112':[1799,159841],
                      '1524':[11818,192814],
                      '2264':[16241,119813],
                      '2242':[15880,112106],
                     '23042':[2230,142976],
                      '2267':[15926,171448],
                      '1120':[9030,140961],
                      '1128':[9033,185348],
                      '1526':[11920,131682],
                      '1121':[2360,142514],
                      '1107':[9030,148671],
                      '2253':[11860,170304],
                      '1103':[1950,200275] }
        
        return txt[pos_dict[id][0]:pos_dict[id][1]]
    
    def batches_gen(self, batch_size, seq_size):
        """
        Batch generator that returns batches of size
        batch_size x seq_size from array.

        :param batch_size: Batch size, the number of sequences per batch
        :seq_size: Number of characters per sequence.
        """
        # Get the number of characters per batch and number of batches we can make
        characters_per_batch = batch_size * seq_size
        n_batches = len(self.array) // characters_per_batch

        # Keep only enough characters to make full batches
        arr = self.array[:characters_per_batch * n_batches]

        # Reshape into n_seqs rows
        arr = arr.reshape((batch_size, -1))

        for n in range(0, arr.shape[1], seq_size):
            # The features
            x = arr[:, n:(n+characters_per_batch)]
            # The targets, shifted by one
            y = np.zeros_like(x)
            y[:, :-1], y[:, -1] = x[:, 1:], x[:, 0]
            yield x, y
            
    def int_to_text(self, array):
        """
        Converts an integer array to the corresponding text.
        
        :param array: numpy array of integers with values in the vocabulary
        """
        int_list = array.tolist()
        char_list = map(lambda x: self.int_to_vocab[x], int_list)
        text = reduce(lambda x, y: x + y, char_list)
        return text
    
    def print(self):
        print(self.text)
    

# Train generator

Now we can use the class written above to read in the Shakespeare plays. Then, for each play we train a separate RNN and use it to generate new text.

For reusability, the script is split into a train and generate function. We will save the model from the train function.


In [52]:

def train(id,obj,iterations=80):
    sentences = []
    next_chars = []
    for i in range(0, obj.text_length - obj.batch_size, obj.step_size):
        sentences.append(obj.text[i: i + obj.batch_size])
        next_chars.append(obj.text[i + obj.batch_size])

    #One hot encode 
    x = np.zeros((len(sentences),obj.batch_size, obj.vocab_length), dtype=np.bool)
    y = np.zeros((len(sentences), obj.vocab_length), dtype=np.bool)
    for i, sentence in enumerate(sentences):
        for t, char in enumerate(sentence):
            x[i, t, obj.vocab_to_int[char]] = 1
            y[i, obj.vocab_to_int[next_chars[i]]] = 1
    
    #Define the LSTM model
    #Long-Short Term Memory layer - Hochreiter 1997 with tanh activation function
    model = Sequential()
    model.add(LSTM(128, input_shape=(obj.batch_size, obj.vocab_length)))
    model.add(Dense(obj.vocab_length))
    model.add(Activation('softmax'))
    optimizer = RMSprop(lr=0.01)
    model.compile(loss='categorical_crossentropy', optimizer=optimizer)
    
    # train the model
    for iteration in range(1, iterations):
        print()
        print('-' * 50)
        print('Iteration', iteration)
        model.fit(x, y,
                  batch_size=128,
                  epochs=1)
    
    #Save model for future use
    model.save('model_' + id + '.h5') 
    return model
    


The generate function starts at a random position in the text and uses it as a "seed". We then create 4 different sequences from different multinomial sampling distributions. Each generated sequence is 2000 characters long.

In [64]:


def generate(id,obj,model):

    def sample(preds, temperature=1.0):
        # helper function to sample an index from a probability array
        preds = np.asarray(preds).astype('float64')
        preds = np.log(preds) / temperature
        exp_preds = np.exp(preds)
        preds = exp_preds / np.sum(exp_preds)
        probas = np.random.multinomial(1, preds, 1)
        return np.argmax(probas)
    
    start_index = random.randint(0, obj.text_length - obj.batch_size - 1)
    gen_text = ''
    
    # generate 4 sequences of 2000 characters 
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        generated = ''
        sentence = obj.text[start_index: start_index + obj.batch_size]
        generated += sentence
              
        for i in range(2000):
            x_pred = np.zeros((1, obj.batch_size, obj.vocab_length))
            for t, char in enumerate(sentence):
                x_pred[0, t, obj.vocab_to_int[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = obj.int_to_vocab[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char
            
        gen_text+=('\n'+generated)
    return gen_text

Finally we will write the generated text to an external file for testing. The files we generated were trained for 80 epochs till we obtained logloss of around .6. We found that around 40 epochs we started seeing identifiable word structure.

In [66]:
def write_shakespeare(id):
    #instantiate object to pass to subsequent functions
    shakespeare = shakespeare_text(id)
    
    #Setting to only one iteration for the final compilation
    #trm contains the trained model. To perform in batch use load_model
    trm = train(id,shakespeare,iterations=2)
    
    #Write generated text to external file
    print(generate(id,shakespeare,trm), file=open("./gen_data/gen_"+ id +"x.txt", "w"))



write_shakespeare('1112')
#write_shakespeare('1524')
#write_shakespeare('2264')
#write_shakespeare('2242')
#write_shakespeare('23042')
#write_shakespeare('2267')
#write_shakespeare('1120')
#write_shakespeare('1128')
#write_shakespeare('1526')
#write_shakespeare('1121')
#write_shakespeare('1107')
#write_shakespeare('2253')
#write_shakespeare('1103')


--------------------------------------------------
Iteration 1
Epoch 1/1


In [None]:
Here is a short excerpt of generated from Julius Ceasar:

"Exit Brutus.
    I dvey, ware you all your hands and stinn.
  CASSIUS. Most noble be some stay athy leess.
    I had rey bears at a seeator and foan
    To what thou somens.
  BRUTUS. I comes to but our hands and that I do not day
    Oven men are and yet he was ambitious, Brutus.
    Stale you is come to you, if words if Malker Cassius
worse."

We can see that the LSTM has managed to identify characters and that their dialogue should begin with capatilized names. 
Basic sentences structure has also been learnt. Sentences start with capital letters and end with full stops.
We can also, however, see that Brutus has exited before delivering his line, so there is still some contextual cues needing to be learned.
Brutus also appears to be talking to himself in this scene.

So, although far from perfect, our generator does generate somewhat passable Shakespeare to the human eye, provided that you dont look to deep.







# Train discriminator

We use an RNN to discriminate between the Shakespeare plays. The goal is to have a classification model based on sequences of text with the target variables corresponding to the different plays.

# Evaluation of generated text

## Evaluation with perplexity score

We first evaluate our generated model using the **perplexity score**. For that we first have to evaluate the perplexity score of each of our models on the original dataset, i.e. we produce a matrix where the rows $i$ correspond to the model trained on play $i$, the columns $j$ correspond to the text of play $j$ and the entries correpond to the perplexity score of model $i$ evaluated on text $j$.

Next, we construct the same matrix, but this time we evaluate the models not on the original plays, but on the generated text. This gives another matrix of perplexity scoers. If our generated model worked well, then the two matrices should look relatively similar.

## Evaluation with discriminator

Next, we can use the trained discriminator to evaluate the generated text. If our text generation model is good, it should achieve accuracy similar to the accuracy in the original plays. It should not work better or worse. 

We also show the confusion matrix. This should also be similar to the one from the original plays.