# In Memoriam
## AI Module IS53024B Coursework II.

The Aim. What are we trying to build?

Building on knowledge from Chapter 8 in the Deep Learning with Phyton book by Francois Chollet we will be generating text using a text file (.txt) containing Animal Farm as our source of reference. In order to build a language model we need lots of text data which is why I decided to use the book Animal Farm by English Novelist and Essayist George Orwell (also very well known for his 1984 book). 

Initially I had opted to use J.R.R. Tolkien Lord of the Rings books, however I found it difficult to distinguish in my results between Tolkein's use of old english and abstract newly created words by the language model under a high temperature. 

Given our decision to use Orwell's Animal Farm, this language model will learn patterns and a writing style specific and inspired by George Orwell. His choice of topics which include but are not limited to social injustice, opposition to totalitarianism and an explicit endorsement of democratic socialism will also come into play.

The architecture we will be using to build this language model is bidirectional long-short-term-memory (L.S.T.M.) although this project could also be performed on 1D Convnet architecture or a stack LSTM just like the example in the Deep Learning with Python book. 

What is an long-short-term-memory? (Notes taken down from Deep Learning with Phyton for the exam)

Developed by Hochreiter and Schmidhuber in 1997 long short term memory was the culminatorial product of their research on the vanishing gradient problem. In a nutshell the LSTM saves information for later, thus preventing older signals from gradually vanishing during processing. 

It is meant to to allow past information to be reinjected at a later time. This is how it fights the vanishing-gradient problem and by result fighting overfitting.

In [45]:
import numpy as np 

#import spacy, and spacy English(en)model
# spacy is used to work on text
import spacy
nlp = spacy.load('en') # nlp stands for natural language processing

# importing necessary libraries to use throughout

import tensorflow
import codecs
import collections
import os
import random
import sys
import os
import time
import h5py
from six.moves import cPickle

#importing layers,models and optimizers along with callbacks from Keras

from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import optimizers
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint 



# loading .txt file 
path = 'nopunc2.txt'
text = open(path).read().lower() # reading text
print('Corpus length:', len(text)) #printing length of text

#print(text)

Corpus length: 962546


An understanding of the data at hand and a good preparation of data is a staple of any deep learing investigation. It is in fact the most important part. In the cells below I make use of the Spacy library to retrieve the words from Animal Farm using the library's tokenizer. I make an effort to minimise the number of potential words in my dictionary discarding of capital notation. Capital notation being only a matter of syntax makes it irrelevant to the task at hand as it does not deal with logic or sense behind word structure. 

In [46]:
# word list creating dunction which creates a list of words available in the .txt file
def create_wordlist(doc):
    wl = []
    for word in doc:
        if word.text not in ("\n","\n\n",'\u2009','\xa0','n’t',',','.','\n  \n\n  \n',' \n\n','“','\n\n  \n',':','-'):
            wl.append(word.text.lower())
    return wl

In [47]:
wordlist = []

input_file = path
#read data
with codecs.open(input_file, "r") as f:
 data = f.read()
        
#create sentences
doc = nlp(data) # nlp stands for natural language processing
wl = create_wordlist(doc)
wordlist = wordlist + wl

The whole of Animal Farm is now transformed into a single list of words, which means that I can now put together a dictionary of words which appear in the book, excluding any duplicates. In the cell below we will assign an index to each word. 

In [48]:
print(wordlist)

word_counts = collections.Counter(wordlist)
# print(word_counts.most_common())

# we can see that'the' appears 2219 time in the book Animal Farm, in comparison other words appear only once, these include 'spades' and 'dispelled'

# Mapping from index to word : that's the vocabulary
vocabulary_inv = [x[0] for x in word_counts.most_common()]
vocabulary_inv = list(sorted(vocabulary_inv))

# Mapping from word to index
vocab = {x: i for i, x in enumerate(vocabulary_inv)}
words = [x[0] for x in word_counts.most_common()] # starts at the top from most common to least common at the bottom

#size of the vocabulary
vocab_size = len(words)
print("vocab size: ", vocab_size)



['\ufeff', 'former', 'opposition', 'leader', 'simon', 'busuttil', 'testified', 'in', 'court', 'this', 'morning', 'as', 'did', 'the', 'prime', 'ministeräôs', 'more', '\n \n\n', 'former', 'opposition', 'leader', 'simon', 'busuttil', 'testified', 'in', 'court', 'this', 'morning', 'as', 'did', 'the', 'prime', 'ministeräôs', 'more', 'whenever', 'the', 'leader', 'of', 'the', 'labour', 'party', 'is', 'asked', 'questions', 'about', 'the', 'more', 'embarrassing', 'aspects', 'of', 'his', 'past', 'he', 'says', 'heäôll', 'leave', 'it', 'to', 'the', 'historians', 'to', 'decide', 'because', 'as', 'far', 'as', 'heäôs', 'concerned', 'itäôs', 'all', 'water', 'under', 'the', 'bridge', 'and', 'he', 'has', 'no', 'regrets', 'and', 'this', 'when', 'nobody', 'has', 'bothered', 'to', 'ask', 'him', 'yet', 'what', 'he', 'thought', 'of', 'the', 'labour', 'governmentäôs', 'corruption', 'and', 'terrible', 'moral', 'and', 'physical', 'violence', 'in', 'the', 'days', 'when', 'he', 'was', 'president', 'of', 'the', 'l

In order to create the training data for our LSTM, we will opt to create two lists. Firstly, a sentences list which is meant to contain our sequences of words and secondly a list which will contain the next words to come after each of the sequence found in the sentences list. 

The way in which this will work, is that we take the 30th first words in the wordlist (dictionary) and the word with index 31 will be the the next word of this sequence and is thus added into the next words list as index 0. Jumping by a step of one we then continue this process and iterate till the last word in the word list. 

Going through this process we manage to yield 34949 sequences and therefore the same amount of next words for every sequence has also been predicted. 

In [49]:
# Length of extracted word sequences
maxlen = 30

# We sample a new sequence every `step` words
step = 1

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up words)
next_words = []

for i in range(0, len(wordlist) - maxlen, step):
    sentences.append(wordlist[i: i + maxlen])
    next_words.append(wordlist[i + maxlen])
    
print('Number of sequences:', len(sentences))

# testing out our code to make sure we are getting the results we expect
print(sentences[5000])
print(next_words[2])

Number of sequences: 171740
['first', 'time', 'dr', 'santäôs', 'philosophy', 'is', 'spelt', 'out', 'in', 'a', 'racy', 'style', 'by', 'a', 'seasoned', 'writer', 'who', 'has', 'known', 'the', 'writerpolitician', 'since', 'their', 'youth', 'and', 'is', 'unsparing', 'in', 'supplying', 'previously']
ministeräôs


Although we are getting somewhere we cannot expect an LSTM to digest a list of strings. Therefore we need to transform these lists into data which is easily digestable by the network we are proposing. One way to do this is to one-hot encode the lists into binary arrays. In doing so sequences of words will be reorganised into matrices made of boolean values in which true (1) values represent the index position of the word in the vocabulary. 

In [50]:
# Next, one-hot encode the words into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, vocab_size), dtype=np.bool)
y = np.zeros((len(sentences), vocab_size), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, word in enumerate(sentence):
        x[i, t, vocab[word]] = 1
    y[i, vocab[next_words[i]]] = 1

Vectorization...


The model we build here is a bidirectional LSTM recurrent neural network followed by a drop out layer. We are expecting the network to provide us with a probability for each available word in the vocabulary to the next one after a given sentence. Therefore in regards the architecture of the model we opt to end with a dense layer the size of the vocabulary along with a softmax activation. With regards to callbacks we implement an Early Stopping function which feeds back every two epochs and stops training as soon as the validation loss stops improving after four epochs as stipulated by the patience parameters.

In [51]:
from tensorflow.keras.layers import LSTM, Input, Bidirectional
from tensorflow.keras.metrics import categorical_accuracy

# using a function to build model architecture in case we may use it again 

def bidir_LSTM(maxlen, vocab_size):
    model = models.Sequential()
    model.add(Bidirectional(LSTM(rnn_size, activation="relu"),input_shape=(maxlen, vocab_size)))
    model.add(layers.Dropout(0.2)) # dropout layer 
    model.add(layers.Dense(vocab_size))
    model.add(layers.Activation('softmax')) 
    
    optimizer = tensorflow.keras.optimizers.RMSprop(lr=learning_rate)
    callbacks=[EarlyStopping(patience=3, monitor='val_loss')]
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=[categorical_accuracy])
    print("model built!")
    return model

In [54]:
rnn_size = 128 # size of RNN
learning_rate = 0.001 #learning rate

Model2_bidir = bidir_LSTM(maxlen, vocab_size)
Model2_bidir.summary()

model built!
Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_6 (Bidirection (None, 256)               4133888   
_________________________________________________________________
dropout_6 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 3908)              1004356   
_________________________________________________________________
activation_6 (Activation)    (None, 3908)              0         
Total params: 5,138,244
Trainable params: 5,138,244
Non-trainable params: 0
_________________________________________________________________


In [2]:
batch_size = 80 # minibatch size
num_epochs = 150 # number of epochs

callbacks=[EarlyStopping(patience=3, monitor='val_loss'),
          ModelCheckpoint(filepath='my_model_gen_sentences.{epoch:02d}-{val_loss:.2f}.hdf5',\
                           monitor='val_loss', verbose=0, mode='auto', period=2)]
#fit the model

history = Model_bidir.fit(x, y,
                 batch_size=batch_size,
                 shuffle=False,
                 epochs=num_epochs,
                 callbacks=callbacks,
                 validation_split=0.3)

#save the model
#Model_bidir.save('my_model_generate_sentences.h5')  
Model_bidir.save('my_model_generate_sentences.h5') 

NameError: name 'EarlyStopping' is not defined

It seems as though in training, the loss is stuck between 6.5 and 6.7 and does not seem to be improving much. It stalls in such a way that the Early Stopping callback decides to put an end to training at the 5th epoch. From what I am thinking this stall in loss improvement may be attributed to the learning rate which may be too high a value and also to the size of the RNN which may need to be slightly bigger to improve representational value. Never the less I will still try to generate text using this training. Taking on board points from previous courseworks, I have factored in the above results into tables to make for easier analysis.

In [22]:
from tabulate import tabulate
print(tabulate([['Epoch 1 0.0682', 420.3232], ['Epoch 2 0.0837 ', 6.5216],['Epoch 3 0.0936', 6.7546], ['Epoch 4 0.1027', 6.7792], ['Epoch 5 0.1071 ',6.7629 ]], headers=['Training_Acc','Training_Loss']))

Training_Acc      Training_Loss
--------------  ---------------
Epoch 1 0.0682         420.323
Epoch 2 0.0837           6.5216
Epoch 3 0.0936           6.7546
Epoch 4 0.1027           6.7792
Epoch 5 0.1071           6.7629


In [23]:
from tabulate import tabulate
print(tabulate([['Epoch 1 0.0764',  6.2176], ['Epoch 2 0.0987 ',6.5826],['Epoch 3 0.1070', 6.6286], ['Epoch 4 0.1302', 6.6122], ['Epoch 5 0.1356', 6.7537]], headers=['Val_Acc','Val_Loss']))

Val_Acc           Val_Loss
--------------  ----------
Epoch 1 0.0764      6.2176
Epoch 2 0.0987      6.5826
Epoch 3 0.1070      6.6286
Epoch 4 0.1302      6.6122
Epoch 5 0.1356      6.7537


In [144]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Here, the probability for a word to be drawn still depends directly on the probability of it being the next word due to our bidirectional LSTM Model. To tune this probability, we opt to introduce a “temperature” parameter to smooth or sharpen its value according to what we would like to see output in terms of word and paragraph structure. 

In [24]:
import random
import sys


words_number = 20 # number of words to generate
seed_sentences = 'instead he backed off further and further and at one point literally turned and ran away into a separate section of the television studio so that he was filmed standing ridiculously under the promotional backdrop for a distributor of cooking implements and other household goods.'
generated = ''
sentence = []

for i in range (maxlen):
    sentence.append(".")

seed = seed_sentences.split()

for i in range(len(seed)):
    sentence[maxlen-i-1]=seed[len(seed)-i-1]

generated += ' '.join(sentence)

for temperature in [0.2, 0.3, 1.0, 1.2]:
    print('------ temperature:', temperature)
    #we generate the text
    for i in range(words_number):
        #create the vector
        x = np.zeros((1, maxlen, vocab_size))
        for t, word in enumerate(sentence):
            x[0, t, vocab[word]] = 1.


        #calculate next word
        preds = Model_bidir.predict(x, verbose=0)[0]
        next_index = sample(preds, temperature)
        next_word = vocabulary_inv[next_index]

        #add the next word to the text
        generated += " " + next_word
        # shift the sentence by one, and and the next word at its end
        sentence = sentence[1:] + [next_word]
    
    print(generated)

------ temperature: 0.2


NameError: name 'sample' is not defined

# The Standard LSTM DLWP Example using Animal Farm by George Orwell

Initially I decided to try and feed the Animal Farm text into the standard Deep Learning with Python Francois Chollet example in order to be able to compare and contrast with the results which are to be retrieved when we train the Biderectional LSTM model on the same text. Below are documented the generated results of the standard example, showing the first epoch and the final epoch at varying degrees of temperature. From these results we can observe a great improvement in the loss from 2.2382 to 0.8436. 

It is also very clear how varying degrees of tempreture affect the output generated, low tempretures tend to provide for very repetitive (as evident from the multiple and's in the first epoch at temp 0.2) and predictable text along with a realistic structure to the writing, the pattern of characters chosen therefore generate words which exists in the English languge. On the other hand higher tempretures provide for more flexible character generation, often ending up with new words which seem plausible making for an interesting read. A few examples of newly generated words from epoch 59 at temp 1.2 are "pilutions" and "implarn". At a higher temperature in the case of character level generation local structure starts to disintegrate and words often seem like a bunch of random character strings. 

It must be noted that this example involved character-level text generation where we were deciding what the next character should be. In the bidirectional example I decide to switch this to word-level text generation. 

epoch 1 Train on 56429 samples 56429/56429 [==============================] - 70s 1ms/sample - loss: 2.2382

--- Generating with seed: "rd in the drawing-room. it was also armounced that the gun w"

------ temperature: 0.2

rd in the drawing-room. it was also armounced that the gun whe he was ster and were when the windmall and all and and and and and and and and and and and and and and and and and and and and the conss and and who and and and and and and the animals sure and and and the for the pigs the sting and whe windmall snow and and and and his seat of the for the windmall stor and and and were the seat the animals stor and and and be the windmell some and when the ani

------ temperature: 1.2

barabie, und men to the your dick ever and when then in but wariwnes sfout the pnitahgid, chle pirise ingly horded hork. in whil fos, whout sanf upy sore seal. but uxplanimals ands in siged the pigs on nit misse, af theife junainwapseld wos furven ofhibt upbed and the farm us. as eysyuons forsh hemware, re almible in whoul wag no comp is of mrike of tt aimadich af. whan engy a" imminsmadsurims, hed animels dennlsher, bnglkion. themet of dyap,s. "nrund tqee

epoch 59 Train on 56429 samples 56429/56429 [==============================] - 60s 1ms/sample - loss: 0.8436

--- Generating with seed: "his sudden uprising of creatures whom they were used to thra"

------ temperature: 0.2

his sudden uprising of creatures whom they were used to thrase which had been discoly. and sometimes said, the animals were appearing the four plang as though they were the commonous dogs and the pigs are discested to be work the salaces of the farm as somethie sains the animals were arrount to the other animals were arroling a terr will shals the pigs arough the same the windmill. he was also decoration of the farm a dicky, whee the stall and were all to

------ temperature: 1.2

le now "and in seem a will implarn jones and ariom to them! thies were ebel for the paitcus for that benjamin was on the animals frightened a prens ?our. they raund, themselves, so not enouls cick and fredicite thein ew animals "sauning back. all again, to seed food lags by the imby the animals work in thing broke querry. ducky, comradespr-ess turnes, a guis of instance-puddless. is seemed to pilutions. as took the since botters no one rapoleing, but but f

As directed by the coursework requirements apart from switching from character level text generation to word-level text generation I also decided to make use of a bidirectional LSTM, use callbacks EarlyStopping and ModuleCheckpoint and to also include a dropout layer in the model architecture to mitigate overfitting. These are all elements which were not included in the DLWP standard LSTM text example but these concepts were all discussed in Part II of DLWP.

# Conclusion

We finish off with a neural network and some scripts capable of generating text in line with Orwell's writing style although not quite! The raw result of the neural networks trained during both the standard DLWP example and the Bidirectional LSTM are not without any flaws however they are something...As alluded to previously increasing the size of the RNN and allowing longer training would most probably yield better results, as well as tuning the model so to limit a great amount of variance. Although indeed text is generated, often times it is difficult to find global meaning in it and to make sense of the paragraph as a whole which brings us to conclude that current deep learning is far behind something which is readable or which can be on-par with the writing of an author. The generated text seems to lack a narative unlike what an author would come to produce. I believe this is one of the main points why text generation fails to achieve global sense. 

How can we improve on this ?

Perhaps the way to do this is to keep trying to improve the generation of sentences. This can be done by detecting patterns in the sequences of sentences in the whole of the novel and not simply looking for patterns in the sequences of words. This could come to reveal the context of a paragraph and would be able to be used to wisely select and formulate the  structure of the next sentence of the text.