**This project aims to use Agatha Christie's Poirot to train a word level language model that can generate text given some words as input. Pre-trained GloVe word embeddings (Newsgroup20) were implemented in an attempt to increase performance.**

**The code was created on google colab and so may run smoother on there, but should be fine on other environments.**

# Prerequisits

**The raw text version of 'Poirot Investigates' by Agatha Christie is downloaded from Project Guttenberg**

In [None]:
!pip install keras-tuner

Collecting keras-tuner
[?25l  Downloading https://files.pythonhosted.org/packages/20/ec/1ef246787174b1e2bb591c95f29d3c1310070cad877824f907faba3dade9/keras-tuner-1.0.2.tar.gz (62kB)
[K     |█████▏                          | 10kB 23.4MB/s eta 0:00:01[K     |██████████▍                     | 20kB 29.1MB/s eta 0:00:01[K     |███████████████▋                | 30kB 19.9MB/s eta 0:00:01[K     |████████████████████▉           | 40kB 23.4MB/s eta 0:00:01[K     |██████████████████████████      | 51kB 24.6MB/s eta 0:00:01[K     |███████████████████████████████▎| 61kB 27.3MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 10.0MB/s 
Collecting terminaltables
  Downloading https://files.pythonhosted.org/packages/9b/c4/4a21174f32f8a7e1104798c445dacdc1d4df86f2f26722767034e4de4bff/terminaltables-3.1.0.tar.gz
Collecting colorama
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-a

In [None]:
import numpy as np
import os
import string
from keras.preprocessing.text import text_to_word_sequence, Tokenizer
from keras.utils import to_categorical
from keras.models import Model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Input
from keras.layers import LSTM
from keras.layers import Embedding
from keras.preprocessing.sequence import pad_sequences
import itertools
import re
from pickle import dump
from pickle import load

from kerastuner.tuners import RandomSearch
from kerastuner.tuners import Hyperband
from kerastuner.engine.hyperparameters import HyperParameters

#Setting random seeds for reproducibility
import random
random.seed(42)
np.random.seed(42)

In [None]:
!curl -O https://www.gutenberg.org/files/61262/61262-0.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  326k  100  326k    0     0   162k      0  0:00:02  0:00:02 --:--:--  162k


**Load the encoder below if you are just wishing to use the model rather than train a new one**

**Unwanted text (publishing information etc) is deleted from the text and a new file saved as 'Poirot.txt'**

In [None]:
#deleting unwanted text and front and back of book
with open('61262-0.txt') as old, open('Poirot.txt', 'w') as new:
    lines = old.readlines()
    new.writelines(lines[110:-374])

In [None]:
#opening the text file, saving into memory as `Poirot` and closing the file
text = open('Poirot.txt', 'r')
Poirot = text.read()
text.close()

#printing first 1000 characters to double check
print(Poirot[:1000])

  The Adventure of “The Western Star”

I was standing at the window of Poirot’s rooms looking out idly on
the street below.

“That’s queer,” I ejaculated suddenly beneath my breath.

“What is, _mon ami_?” asked Poirot placidly, from the depths of
his comfortable chair.

“Deduce, Poirot, from the following facts! Here is a young lady,
richly dressed—fashionable hat, magnificent furs. She is coming
along slowly, looking up at the houses as she goes. Unknown to her,
she is being shadowed by three men and a middle-aged woman. They
have just been joined by an errand boy who points after the girl,
gesticulating as he does so. What drama is this being played? Is
the girl a crook, and are the shadowers detectives preparing to
arrest her? Or are _they_ the scoundrels, and are they plotting to
attack an innocent victim? What does the great detective say?”

“The great detective, _mon ami_, chooses, as ever, the simplest
course. He rises to see for himself.” And my friend joined me at
the window.


In [None]:
# using keras' `text_to_word_sequence` function to tokenise the text
# #additional characters added to filters to take out weird quotes
cleaned_tokens = text_to_word_sequence(Poirot, filters='“”’‘•!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')

print(f'There are {len(cleaned_tokens)} words in the cleaned version of Poirot Investigates ({len(set(cleaned_tokens))} unique words).')
print('------------------------------------------------------------------------------------------------------')
print(f'Sample of 10 tokens: {cleaned_tokens[:10]}')

There are 53501 words in the cleaned version of Poirot Investigates (6167 unique words).
------------------------------------------------------------------------------------------------------
Sample of 10 tokens: ['the', 'adventure', 'of', 'the', 'western', 'star', 'i', 'was', 'standing', 'at']


In [None]:
#make sequences of 36 tokens
length = 36
sequences = list()
for i in range(length, len(cleaned_tokens)):
	# select sequence of tokens
	seq = cleaned_tokens[i-length:i]
	# convert into a line
	line = ' '.join(seq)
	# store
	sequences.append(line)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 53465


# Encoding

In [None]:
# defining tokenizer class to encode word sequences
encoder = Tokenizer()
encoder.fit_on_texts(sequences)
encoded_lines = encoder.texts_to_sequences(sequences)
print('--first line of text shown below in encoded form--')
print(encoded_lines[0])
#saving encoder for use at time of generation
dump(encoder, open('./encoder.pkl', 'wb'))

--first line of text shown below in encoded form--
[1, 1566, 5, 1, 1116, 407, 3, 10, 661, 23, 1, 279, 5, 13, 17, 830, 311, 52, 3022, 25, 1, 599, 1284, 12, 17, 1283, 3, 1114, 260, 1113, 21, 731, 38, 14, 120, 140]


In [None]:
word_index = encoder.word_index
print('printing 10 values from the `word_index` dictionary')
print('---------------------------------------------------')
dict(itertools.islice(word_index.items(), 10))

printing 10 values from the `word_index` dictionary
---------------------------------------------------


{'a': 4,
 'and': 6,
 'he': 8,
 'i': 3,
 'in': 7,
 'it': 9,
 'of': 5,
 'the': 1,
 'to': 2,
 'was': 10}

In [None]:
# defining numer of words in vocabulary
word_dim = len(set(cleaned_tokens))+1
# making encoded lines into numpy array so that model can process
encoded_lines = np.array(encoded_lines)

# defining predictor and response variables full dataset for full training
X, y = encoded_lines[:,:-1], encoded_lines[:,-1]

#splitting to train and validate for tuning
X_train, y_train = encoded_lines[:45000,:-1], encoded_lines[:45000,-1]
X_val, y_val = encoded_lines[45000:,:-1], encoded_lines[45000:,-1]

# one hot encoding response variable to fit with softmax function dense layer
y = to_categorical(y, num_classes=word_dim)
y_train = to_categorical(y_train, num_classes=word_dim)
y_val = to_categorical(y_val, num_classes=word_dim)

# saving the sequence length for later use
sequence_len = X.shape[1]
sequence_len_train = X_train.shape[1]

# Tuning

In [None]:
#checking shape of training data
X.shape

(52773, 35)

In [None]:
# define tuner model

def build_model(hp):
  #sequential less flexible but using as it allows `predict_classes` later on
  model = Sequential()

  #adding embedding layer to project words into higher dimensional space
  model.add(Embedding(word_dim, hp.Int('n_dimensions', 50, 200, 50), input_length=sequence_len))
  
  # tuning number of return sequence layers alongside number of units in each layer
  for i in range(hp.Int('n_layers', 1, 3)):
    model.add(LSTM(hp.Int('n_units', 64, 256, 64), return_sequences=True))

  model.add(LSTM(hp.Int('n_units', 64, 256, 64)))

  model.add(Dense(hp.Int('n_units', 64, 256, 64), activation='relu'))

  #softmax dense layer to output probability distribution for each word
  model.add(Dense(word_dim, activation='softmax'))

  print(model.summary())

  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

  return model

In [None]:
# setting up tuner object and how long to tune for
seed = 1 # seed to improve reproducability
executions_per_trial = 2
max_trials = 3
tuner = RandomSearch(build_model, max_trials = max_trials , objective = 'val_accuracy', seed=seed, executions_per_trial = executions_per_trial, directory = './LOG_DIR_Poirot')

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 35, 50)            319000    
_________________________________________________________________
lstm (LSTM)                  (None, 35, 64)            29440     
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dense_1 (Dense)              (None, 6380)              414700    
Total params: 800,324
Trainable params: 800,324
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
tuner.search_space_summary()

Search space summary
Default search space size: 3
n_dimensions (Int)
{'default': None, 'conditions': [], 'min_value': 50, 'max_value': 200, 'step': 50, 'sampling': None}
n_layers (Int)
{'default': None, 'conditions': [], 'min_value': 1, 'max_value': 3, 'step': 1, 'sampling': None}
n_units (Int)
{'default': None, 'conditions': [], 'min_value': 64, 'max_value': 256, 'step': 64, 'sampling': None}


In [None]:
# executing search with validation data to evaluate
tuner.search(X_train, y_train, epochs=10, batch_size=128, validation_data=(X_val,y_val))


Search: Running Trial #1

Hyperparameter    |Value             |Best Value So Far 
n_dimensions      |50                |?                 
n_layers          |3                 |?                 
n_units           |128               |?                 

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 35, 50)            319000    
_________________________________________________________________
lstm (LSTM)                  (None, 35, 128)           91648     
_________________________________________________________________
lstm_1 (LSTM)                (None, 35, 128)           131584    
_________________________________________________________________
lstm_2 (LSTM)                (None, 35, 128)           131584    
_________________________________________________________________
lstm_3 (LSTM)                (None, 128)               131584   

KeyboardInterrupt: ignored

**The best model chosen by keras tuner is printed below. A smaller model with the least layers available in the search was chosen, and a mid value of 128 units per lstm layer. Since the data is very small, perhaps the deeper more complex models were quickly overfitting.**

In [None]:
#saving best model once tuning complete
best_model = tuner.get_best_models(num_models=1)[0]

# Main model

**After poor results using the model selected by tuning, it was decided to use pre-trained word embeddings. This would hopefully mitigate the fact that useful word embeddings may not have been learned due to the small training data**

**The below code block is adapted from Keras for transfer learning of gloVe pre-trained word embeddings (Newsgroup20)**

https://keras.io/examples/nlp/pretrained_word_embeddings/

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip


path_to_glove_file = "./glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))


num_tokens = len(set(cleaned_tokens)) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

from keras.initializers import  Constant

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=Constant(embedding_matrix),
    trainable=False,
)

--2021-02-25 11:32:09--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-02-25 11:32:09--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-02-25 11:32:09--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

**The chosen model is defined below**

In [None]:
input_layer = Input(shape=(sequence_len,))

#adding embedding layer to project words into higher dimensional space
embedded = embedding_layer(input_layer)

lstm1 = LSTM(128, return_sequences=True)(embedded)

lstm2 = LSTM(128, return_sequences=True)(lstm1)

lstm3 = LSTM(128)(lstm2)

dense = Dense(128, activation='relu')(lstm3)

#softmax dense layer to output probability distribution for each word
softmax = Dense(word_dim, activation='softmax')(dense)

model = Model(inputs=input_layer, outputs=softmax)

print(model.summary())

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 35)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 35, 100)           638100    
_________________________________________________________________
lstm_3 (LSTM)                (None, 35, 128)           117248    
_________________________________________________________________
lstm_4 (LSTM)                (None, 35, 128)           131584    
_________________________________________________________________
lstm_5 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_2 (Dense)              (None, 128)               16512     
_________________________________________________________________
dense_3 (Dense)              (None, 6168)              7956

# Fitting

**The model is now trained with the full data for 200 epochs. 200 Epochs maybe overfits when evaluated with the validation set used previously, but since the model is not being evaluated on test data and accuracy, rather plausability, this doesn't matter so much and experimentation showed longer training to give more plausible results in the style of Agatha Christie.**

In [None]:
model.fit(X, y, batch_size=64, epochs=200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f2b1236c790>

In [None]:
# saving model weights to allow loading for predictions at later date
model.save_weights('./Poirot.h5')

# Testing

**Load the trained weights to the model here if you are not using the model right after training**

In [None]:
#loading model weights from previous training run
model.load_weights('./Poirot.h5')
print('--------------------')
print('model weights loaded')
print('--------------------')

--------------------
model weights loaded
--------------------


**The final model can be examined below**

In [None]:
print(model.summary())

**The previously used encoder is loaded below to ensure consistency**

In [None]:
#loading the encoder used previously
encoder = load(open('encoder.pkl', 'rb'))
print('--------------------')
print('encoder loaded')
print('--------------------')

--------------------
encoder loaded
--------------------


**Two functions are written: `predict_next()` and `generate_text()`**

**`predict_next()` takes a sequence of words of any length (pads to 35, so 35+ is best) as input and predicts the next word**

**`generate_text()` takes a sequence of words of any length (pads to 35, so 35+ is best) as input, and outputs a generated sequence of length specified by the user.**



In [None]:
import builtins
def predict_next():
  #taking input as seed
  seed=str(builtins.input('please enter some words (preferably 35 or more) as input to the model: '))
  #cleaning input seed in same way as training text cleaned
  seed=text_to_word_sequence(seed, filters='“”•!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')

  #displaying seed
  print('--------------------------------------------------------------')
  print(f'input_seed = {seed}')
  print('--------------------------------------------------------------')

  #encoding seed with encoder used for training
  seed_encoded = encoder.texts_to_sequences([seed])[0]
  #padding the seed so that inputs of varying sizes become 3 (longer takes last 3 words)
  seed_encoded = pad_sequences([seed_encoded], maxlen=sequence_len, truncating='pre')
  #prdicting next word using model
  next = np.argmax(model.predict(seed_encoded), axis=-1)
  next_word = ''
  for word, index in encoder.word_index.items():
    if index == next:
      next_word = word
      break
  print(f'Generated next word: *{next_word}*')
  print('--------------------------------------------------------------')
  print(f'New text: {" ".join(seed)} *{next_word}*')

def generate_text():
  #defining empty list for text to be stored in iteratively
  text = list()
  #taking input as seed
  input_seed = str(builtins.input('please enter some words (preferably 35 or more) as input to the model: '))
  #use input of number of words for generated sentence (validated)
  while True: 
      try:
          text_len = int(builtins.input("Please enter a number for the length of text to be generated: "))
          break
      except ValueError:
          print("Please enter a number...")
  #cleaning input seed in same way as training text cleaned
  seed=text_to_word_sequence(input_seed, filters='“”•!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')

  #displaying seed input
  print('--------------------------------------------------------------')
  print(f'input seed = {input_seed}')
  print('--------------------------------------------------------------')

  # iteratively predicting and appending new words to seed to generate text
  for i in range(text_len):
    #encoding seed using generator used previously
    seed_encoded = encoder.texts_to_sequences([seed])[0]
    seed_encoded = pad_sequences([seed_encoded], maxlen=sequence_len, truncating='pre')
    next = np.argmax(model.predict(seed_encoded), axis=-1)
    next_word = ''
    for word, index in encoder.word_index.items():
      if index == next:
        next_word = word
        break
    seed += ' ' + ''.join(next_word)
    text.append(next_word)
  print(f'Generated next word: *{text}*')
  print('--------------------------------------------------------------')
  print(f"""New text: {input_seed + " " + " ".join(text)}""")

**Testing out the `predict_next()` function (run code and you will be prompted)**

In [None]:
predict_next()

please enter some words (preferably 35 or more) as input to the model: the hand poirot led the way out of the flat and down the stairs our captive followed and i brought up the rear with the revolver when we were out in the street poirot turned to
--------------------------------------------------------------
input_seed = ['the', 'hand', 'poirot', 'led', 'the', 'way', 'out', 'of', 'the', 'flat', 'and', 'down', 'the', 'stairs', 'our', 'captive', 'followed', 'and', 'i', 'brought', 'up', 'the', 'rear', 'with', 'the', 'revolver', 'when', 'we', 'were', 'out', 'in', 'the', 'street', 'poirot', 'turned', 'to']
--------------------------------------------------------------
Generated next word: *me*
--------------------------------------------------------------
New text: the hand poirot led the way out of the flat and down the stairs our captive followed and i brought up the rear with the revolver when we were out in the street poirot turned to *me*


**Testing out the `generate_text()` function (run code and you will be prompted)**

In [None]:
generate_text()

please enter some words (preferably 35 or more) as input to the model: and my friend hastings looks at me with eyes of reproach but it was all so simple it was clear from the first that out of several hundred probably applicants for no 4 montagu mansions only
Please enter a number for the length of text to be generated: 35
--------------------------------------------------------------
input seed = and my friend hastings looks at me with eyes of reproach but it was all so simple it was clear from the first that out of several hundred probably applicants for no 4 montagu mansions only
--------------------------------------------------------------
Generated next word: *['the', 'in', 'the', 'poirot', 'one', 'you', 'after', 'as', 'minister', 'what', 'd', 'of', 'he', 'that', 'white', 'ourselves', 'o', 'with', 'the', 'on', 'imaginary', 'of', 'he', 'i', 'don', 'the', 'my', 'thief', 'the', 'poirot', 'yardly', 'a', 'little', 'enough', 'her']*
-----------------------------------------------------

In [None]:
sequences[16000]

'and my friend hastings looks at me with eyes of reproach but it was all so simple it was clear from the first that out of several hundred probably applicants for no 4 montagu mansions only'

Christie, A., 2021. Poirot Investigates by Agatha Christie. [online] Project Gutenberg. Available at: <http://www.gutenberg.org/ebooks/61262> [Accessed 16 February 2021].

Team, K., 2021. Keras documentation: Using pre-trained word embeddings. [online] Keras.io. Available at: <https://keras.io/examples/nlp/pretrained_word_embeddings/> [Accessed 21 February 2021].

Brownlee, J., 2021. How to Develop a Word-Level Neural Language Model and Use it to Generate Text. [online] Machine Learning Mastery. Available at: <https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/> [Accessed 17 February 2021].

Medium. 2021. Next Word Prediction: A Complete Guide. [online] Available at: <https://medium.com/linagoralabs/next-word-prediction-a-complete-guide-d2e69a7a09e6> [Accessed 24 February 2021].