## Section 0: Justin Chan

## Section 1: Text Generation using RNNs

## Section 2: Project Definition

### Goals

The goal of the project is to develop a recurrent neural network (RNN) capable of generating text.

### Dataset

The data source used to train the RNN is "A Tale of Two Cities" by Charles Dickens.
This was obtained from the Gutenberg Project, [here](https://www.gutenberg.org/files/98/98-0.txt).


We reference the coding examples to build our RNN.
- [Creating A Text Generator Using Recurrent Neural Network, by Trung Tran  ](https://chunml.github.io/ChunML.github.io/project/Creating-Text-Generator-Using-Recurrent-Neural-Network/)

- [Chun ML's Github repo on text generation](https://github.com/ChunML/text-generator)

- [The Unreasonable Effectiveness of Recurrent Neural Networks, by Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

The example provided by Karpathy showed interesting examples of how character-by-character trained and generation allowed for varied examples including generation of C code, Shakespeare etc.

### Tasks

We would perform the following.

Example:
 
1. Download the data.
2. Process the data so it is suitable for input into RNN
3. Train a recurrent neural network (LSTM) based on keras
4. Generate examples from the model.

## Section 3: Data Engineering

We perform data related tasks

* Manually remove irrelevant data in the beginning of the text file (eg. copyright info ).
* Import the training text.

In [27]:
#import libraries
import numpy as np
import matplotlib.pyplot as plt
import time
import csv
from keras.models import Sequential, load_model
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM, SimpleRNN
from keras.layers.wrappers import TimeDistributed

In [2]:
#initializing some variables
DATA_DIR = "a_tale_of_two_cities.txt"
BATCH_SIZE = 50
HIDDEN_DIM = 500
SEQ_LENGTH = 50
WEIGHTS = ''
GENERATE_LENGTH = 500
LAY 

### Read in the data

We remove some irrelevant information manually before importing the text(ie. copyright from the Gutenberg Project).

In [4]:
data = open(DATA_DIR, 'r', encoding='utf8').read()
chars = list(set(data))
VOCAB_SIZE = len(chars)

## Section 4: Feature Engineering

We create conversion functions, and also generate the X, y datasets for training

### Create conversion functions to deal with characters

In [5]:
#conversion functions
ix_to_char = {ix:char for ix, char in enumerate(chars)}
char_to_ix = {char:ix for ix, char in enumerate(chars)}

In [8]:
# Create X, y

#initialize with zero first, but of correct dimensions
X = np.zeros((len(data)//SEQ_LENGTH, SEQ_LENGTH, VOCAB_SIZE))
y = np.zeros((len(data)//SEQ_LENGTH, SEQ_LENGTH, VOCAB_SIZE))


#do this for each input block
for i in range(0, len(data)//SEQ_LENGTH):
    X_sequence = data[i*SEQ_LENGTH:(i+1)*SEQ_LENGTH]
    X_sequence_ix = [char_to_ix[value] for value in X_sequence]
    input_sequence = np.zeros((SEQ_LENGTH, VOCAB_SIZE))
    for j in range(SEQ_LENGTH):
        input_sequence[j][X_sequence_ix[j]] = 1.
    X[i] = input_sequence

    #repeat the sequence for the y block
    y_sequence = data[i*SEQ_LENGTH+1:(i+1)*SEQ_LENGTH+1]
    y_sequence_ix = [char_to_ix[value] for value in y_sequence]
    target_sequence = np.zeros((SEQ_LENGTH, VOCAB_SIZE))
    for j in range(SEQ_LENGTH):
        target_sequence[j][y_sequence_ix[j]] = 1.
    y[i] = target_sequence

## Section 5: Model Engineering

For model engineering:
* The RNN is based on LSTM with two layers (ie. specified by LAYER_NUM)
* After each epoch, the model generates an output of length 500 (specified by GENERATE_LENGTH). This allows us to observe the quality and any anomalies related to the generated text.

### Create a function that can generate text from the RNN model when provided with a model

In [22]:
def generate_text(model, length, vocab_size, ix_to_char):
    # starting with random character
    ix = [np.random.randint(vocab_size)]
    y_char = [ix_to_char[ix[-1]]]
    X = np.zeros((1, length, vocab_size))
    for i in range(length):
        # appending the last predicted character to sequence
        X[0, i, :][ix[-1]] = 1
        print(ix_to_char[ix[-1]], end="")
        ix = np.argmax(model.predict(X[:, :i+1, :])[0], 1)
        y_char.append(ix_to_char[ix[-1]])
    return ('').join(y_char)

### Specify and compile the model

In [14]:
model = Sequential()
model.add(LSTM(HIDDEN_DIM, input_shape=(None, VOCAB_SIZE), return_sequences=True))
for i in range(LAYER_NUM - 1):
    model.add(LSTM(HIDDEN_DIM, return_sequences=True))
model.add(TimeDistributed(Dense(VOCAB_SIZE)))
model.add(Activation('softmax'))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")

In [32]:
model.save('my_model.h5')

In [33]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, None, 500)         1168000   
_________________________________________________________________
lstm_4 (LSTM)                (None, None, 500)         2002000   
_________________________________________________________________
time_distributed_2 (TimeDist (None, None, 83)          41583     
_________________________________________________________________
activation_2 (Activation)    (None, None, 83)          0         
Total params: 3,211,583
Trainable params: 3,211,583
Non-trainable params: 0
_________________________________________________________________


### Run the model, saving checkpoints and generating text from current model during each epoch run. 

In [39]:
nb_epoch = 43
while nb_epoch<46:
    print('\n\n')
    model.fit(X, y, batch_size=BATCH_SIZE, verbose=1, nb_epoch=1)
    nb_epoch += 1
    #generate_text(model, GENERATE_LENGTH)
    generate_text(model, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)
    #if nb_epoch % 10 == 0:
    #    model.save_weights('checkpoint_{}_epoch_{}.hdf5'.format(HIDDEN_DIM, nb_epoch))
    model.save_weights('checkpoint_{}_epoch_{}.hdf5'.format(HIDDEN_DIM, nb_epoch))




Epoch 1/1


  after removing the cwd from sys.path.




“I have delivered that letter,” said Carton, “I wish we might be friend, I speak of see. A moment that more possible to us; if he was in no other tone, or office had been out of their domestic streets. I had had a pulpail docifes for that word!”

“Come, then, my wife,” said Defarge. He looked up, and walked to and fro, and the clocks struck the
number and dark his work.

The crowd was there, and the night broke of its room, and said, “I am not surely equap.”

“The timid hands he was presused t


Epoch 1/1
--a pity to
live no better life?”

“God knows it is stronger, who had been drinking in at the gate, struck and mild, against a rating smile that he could so depositate, instead of watchful sound of wrong, in an affection to be registered. If the bell at Lucie and the hollow revolucte of mind, Mr.
Lorry burnt, and it was so strong and fied, and the griedy that had been taken from his port, and took her forehead on his sake, by the breast of a stander of boar in advance, the day were 

## Section 6: Evaluate Metrics

We ran the models through more than 40 epochs (ie. the initial few epochs were overwritten and not saved). 

We can load the models at various stages and observe their outputs below.

In [34]:
### Model after epoch 1
model_epoch1 = load_model('my_model.h5')
model_epoch1.load_weights('checkpoint_500_epoch_1.hdf5')
print("Text Generation from Model after 1st epoch: ")
generate_text(model_epoch1, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)

Text Generation from Model after 1st epoch: 
F the prisoner had been a little consideration and the sun was a change of the courtyard and the sun was a little child and the sun with his hands in the streets.

“I don't know what is that this is a little minute of the prisoner when I will go to the prisoner in the courtyard that I have been a little morning, that I will not be and the prisoner in the courtyard that I have been a little morning, that I will not be and the prisoner in the courtyard that I have been a little morning, that I wil

"F the prisoner had been a little consideration and the sun was a change of the courtyard and the sun was a little child and the sun with his hands in the streets.\n\n“I don't know what is that this is a little minute of the prisoner when I will go to the prisoner in the courtyard that I have been a little morning, that I will not be and the prisoner in the courtyard that I have been a little morning, that I will not be and the prisoner in the courtyard that I have been a little morning, that I will"

In [35]:
### Model after epoch 10
model_epoch10 = load_model('my_model.h5')
model_epoch10.load_weights('checkpoint_500_epoch_10.hdf5')
print("Text Generation from Model after 10th epoch: ")
generate_text(model_epoch10, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)

Text Generation from Model after 10th epoch: 
_ he said it with the last side, burnt on the stairs and close to himself, and was soon with the air. He had not been seen of the words were by the first to see the window, and she sat up again at the bench of his work, that he would not be
released and drawn to the black brother, as the fountain at all a stair and fancy, and was drawing him and her hand on his bench and the chance flowed on her breast, as if they were at seat, and was soon the little counter, and showed itself in his breast of 

'_ he said it with the last side, burnt on the stairs and close to himself, and was soon with the air. He had not been seen of the words were by the first to see the window, and she sat up again at the bench of his work, that he would not be\nreleased and drawn to the black brother, as the fountain at all a stair and fancy, and was drawing him and her hand on his bench and the chance flowed on her breast, as if they were at seat, and was soon the little counter, and showed itself in his breast of t'

In [36]:
### Model after epoch 20
model_epoch20 = load_model('my_model.h5')
model_epoch20.load_weights('checkpoint_500_epoch_20.hdf5')
print("Text Generation from Model after 20th epoch: ")
generate_text(model_epoch20, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)

Text Generation from Model after 20th epoch: 
Lorry was a pleasant sight too,
beaming at all this on the body as stopped by a low before it, and might have been a solemn interest in the eyes, and the clock struck at him, the first cause of the stairs were as tention of the life of her life and hope they were always with his flopping like a fireness to the days before the family, and a tender pass of the house for the little
counter, as if the airsed of a flamin will with a rady groun, and shine of the landscape, turned for the worse to the 

'Lorry was a pleasant sight too,\nbeaming at all this on the body as stopped by a low before it, and might have been a solemn interest in the eyes, and the clock struck at him, the first cause of the stairs were as tention of the life of her life and hope they were always with his flopping like a fireness to the days before the family, and a tender pass of the house for the little\ncounter, as if the airsed of a flamin will with a rady groun, and shine of the landscape, turned for the worse to the s'

In [37]:
### Model after epoch 30
model_epoch30 = load_model('my_model.h5')
model_epoch30.load_weights('checkpoint_500_epoch_30.hdf5')
print("Text Generation from Model after 30th epoch: ")
generate_text(model_epoch30, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)

Text Generation from Model after 30th epoch: 
9”

“Since it is my husband?” said Madame Defarge. “I have the honour of counself and superiors when he saw the passage ladyed on a short notice, for you are provided on business who says it is (a moment, and where a quantic straight continuty of the coach arrs. A feir of his voice and
he precious at drank and dancer passed for a man who read house, and his hear was closed out of the street, and not one with barber, and write on the scafful body as if there had been taken estated at him, and alw

'9”\n\n“Since it is my husband?” said Madame Defarge. “I have the honour of counself and superiors when he saw the passage ladyed on a short notice, for you are provided on business who says it is (a moment, and where a quantic straight continuty of the coach arrs. A feir of his voice and\nhe precious at drank and dancer passed for a man who read house, and his hear was closed out of the street, and not one with barber, and write on the scafful body as if there had been taken estated at him, and alwa'

In [38]:
### Model after epoch 40
model_epoch40 = load_model('my_model.h5')
model_epoch40.load_weights('checkpoint_500_epoch_40.hdf5')
print("Text Generation from Model after 40th epoch: ")
generate_text(model_epoch40, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)

Text Generation from Model after 40th epoch: 
You want
a promise from me. A man what I had done, eyebrous for me; my nephew,” said Defarge, with a watch of time when the postilions had come over her hand. An
horring in the further changed torn would only hear the extraorde in its coach and grass, had been heavy count of viving access to or, as if
his dame been to any of the two touch some days when it was necessary to be done, and holding the penderingement were so confused. As his lips of the triumphs of the face was desired in to
work upo

'You want\na promise from me. A man what I had done, eyebrous for me; my nephew,” said Defarge, with a watch of time when the postilions had come over her hand. An\nhorring in the further changed torn would only hear the extraorde in its coach and grass, had been heavy count of viving access to or, as if\nhis dame been to any of the two touch some days when it was necessary to be done, and holding the penderingement were so confused. As his lips of the triumphs of the face was desired in to\nwork upon'

## Section 7: Observations and analysis

Answer the following questions:
1. What do you conclude from the metrics?

2. If the metrics are not good, try to find out what is the reason in order to improve the model. What kind of inputs does the model not do well? (i.e. what are the blind spots or invalid assumptions?). Note that to answer this question, you need to decide what a "good" result is for your problem formulation.

3. What improvements do you propose?

### Observations

We can observe several aspects from the output from the model.

* Early epochs generated text with repeated sequences. For the model after epoch 1, it repeated "I will not be and the prisoner in the courtyard that I have been a little morning" several times. Later epochs had less of such repetitions.

* Early epoch generated text with less spelling errors. This may be due to the memory of a limited, but well-learnt vocabulary which it tends not to deviate from.

* New line breaks are a challenge for the model to learn.

* The model can learn punctuations, particularly direct speech. It would know to:
    * generate both open and closing quotes.
    * end a sentence within the quotes with a comma (instead of fullstop)
    * follow the direct speech with 'said' 

Areas of improvement include:

* Use of a machine with GPU (as the training takes a lot of time).
* Use more hidden layers which may improve the results
* Train for more epochs. The references mentioned that researchers had used up to 200 epochs in order to train good models.

Areas to explore include:

* Experiment to see if the generated text creates the same word2vec structure as the original text. 
We try to determine if RNNs can learn similar 'word associations' if trained long enough.
