# Text Generation

## Deep Learning

For this model, we can decide between either GRU or LSTM units. GRU units train faster and are simpler than LSTMS. However, since this project has longer sequences and require longer relationship modeling, an LSTM unit was chosen. This was a fun project inspired by Understanding LSTMs (Colah's blog) and The Unreasonable Effectiveness of Recurrent Neural Networks (Andrej Karpathy's blog). 

---

## Step 0: Import required packages

In [1]:
## LIST OF ALL IMPORTS
import os
import math
import random
import time
import os.path as path
from datetime import datetime

import numpy as np
import tensorflow as tf
import keras

from keras.models import load_model, Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, TimeDistributed, SimpleRNN, Activation
from keras.optimizers import RMSprop, Adam

from utils import load_corpus

Using TensorFlow backend.


---

## Step 1: Data Check

While this procedure can be carried out with any dataset (i.e. any books/plays) as well as smaller prose like poems/short stories or corpus of tweets, I wanted to test this out with the Harry Potter series (out of personal preference).Furthermore, for exploration, we extract overlapping sequences, to find the number of sentences and unique characters present.

In [2]:
data_path='dataset/raw_hp.txt'
text_sample=open(data_path).read().lower()
print('Corpus of Harry Potter texts has a length of {}...muggle.'.format(len(text_sample)))

length_sequence=100
sample_step=3
extracted_sequences=[]

for i in range(0,len(text_sample)-length_sequence,sample_step):
    extracted_sequences.append(text_sample[i:i+length_sequence])
unique_characters=sorted(list(set(text_sample)))
# List of unique characters in the corpus
print('Number of sequences:', len(extracted_sequences))
print('Unique characters:', len(unique_characters))

Corpus of Harry Potter texts has a length of 6272545...muggle.
Number of sequences: 2090829
Unique characters: 71


In [3]:
print('Sample Text: ',text_sample[0:124])
print('Extracted sequence every 3 characters: ',extracted_sequences[0:10]) # Every three characters

Sample Text:  mr. and mrs. dursley, of number four, privet drive, were proud to say that they were perfectly normal, thank you very much. 
Extracted sequence every 3 characters:  ['mr. and mrs. dursley, of number four, privet drive, were pro', ' and mrs. dursley, of number four, privet drive, were proud ', 'd mrs. dursley, of number four, privet drive, were proud to ', 'rs. dursley, of number four, privet drive, were proud to say', ' dursley, of number four, privet drive, were proud to say th', 'rsley, of number four, privet drive, were proud to say that ', 'ey, of number four, privet drive, were proud to say that the', ' of number four, privet drive, were proud to say that they w', ' number four, privet drive, were proud to say that they were', 'mber four, privet drive, were proud to say that they were pe']


In [4]:
print("Loading corpus of texts.")
X,Y,len_vocabulary,int2char=load_corpus(data_path,length_sequence)

Loading corpus of texts.


MemoryError: 

---

## Step 2: Network creation

In [None]:
def text_generation(model,text_length,vocabulary,id2char):
    X=np.zeros((1,text_length,vocabulary))
    id_start=np.random.randint(vocabulary)
    y_pred=[id2char[id_start[-1]]]
    
    for char_index in range(text_length):
        #Starting with random character id, add prediction to sequence
        X[0,i,:][id_start[-1]]=1.
        id_start=np.argmax(model.predict(X[:,i+1,:])[0],1) # Max likelihood
        y_pred.append(id2char[id_start[-1]])
    return (' ').join(y_pred)

In [None]:
input_shape=(None,vocabulary)
num_layers=2
embedding_size=32
rnn_hidden_layers=256
n_epochs=60
batch_size=128

model=Sequential()
model.add(Embedding(vocabulary,embedding_size,batch_input_shape=input_shape))
for i in range(num_layers):
    model.add(LSTM(rnn_hidden_layers,return_sequences=True,stateful=True))
model.add(TimeDistributed(Dense(vocabulary,activation='softmax')))
model.add(LSTM(256,input_shape=input_shape,return_sequences=True))
optimizer=RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy',optimizer=optimizer)

model.summary()

---

## Step 3: Generate Tests and Training

In [None]:
save_path='generated/'



print("Generation before training.")

prev=text_generation(model,500,vocabulary,id2char)

print("Training.")

for epoch in range(n_epochs):
    print("Training epoch ",epoch)
    model.fit(X,Y,batch_size=batch_size,verbose=1,nb_epoch=1)
        
    if epoch%5==0:
        generated_novel=text_generation(model,500,vocabulary,id2char)
        model.save_weights('Weights_epoch{}.hdf5'.format(epoch))
        file_name='hp_generated_epoch_'+str(epoch)+'.txt'
        with open(file_name,'w')as f:
            f.write(generated_novel)
            f.close()
        print('Novel generated at Epoch {}:'.format(epoch))
        print(generated_novel)