<a href="https://colab.research.google.com/github/pbevan1/Word-Level-Language-Model-Poirot/blob/main/poirot_recurrent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**The raw text version of 'Poirot Investigates' by Agatha Christie is downloaded from Project Guttenberg**

In [115]:
!pip install keras-tuner



In [168]:
import numpy as np
import os
import string
from keras.preprocessing.text import text_to_word_sequence, Tokenizer
from keras.utils import to_categorical
from keras.models import Model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Input
from keras.layers import LSTM
from keras.layers import Embedding
from keras.preprocessing.sequence import pad_sequences
import itertools
import re
from pickle import dump

from kerastuner.tuners import RandomSearch
from kerastuner.tuners import Hyperband
from kerastuner.engine.hyperparameters import HyperParameters

In [61]:
!curl -O https://www.gutenberg.org/files/61262/61262-0.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  326k  100  326k    0     0  1527k      0 --:--:-- --:--:-- --:--:-- 1527k


**Unwanted text (publishing information etc) is deleted from the text and a new file saved as 'Poirot.txt'**

In [62]:
#deleting unwanted text and front and back of book
with open('61262-0.txt') as old, open('Poirot.txt', 'w') as new:
    lines = old.readlines()
    new.writelines(lines[110:-374])

In [169]:
#opening the text file, saving into memory as `Poirot` and closing the file
text = open('Poirot.txt', 'r')
Poirot = text.read()
text.close()

#printing first 1000 characters to double check
print(Poirot[:1000])

  The Adventure of “The Western Star”

I was standing at the window of Poirot’s rooms looking out idly on
the street below.

“That’s queer,” I ejaculated suddenly beneath my breath.

“What is, _mon ami_?” asked Poirot placidly, from the depths of
his comfortable chair.

“Deduce, Poirot, from the following facts! Here is a young lady,
richly dressed—fashionable hat, magnificent furs. She is coming
along slowly, looking up at the houses as she goes. Unknown to her,
she is being shadowed by three men and a middle-aged woman. They
have just been joined by an errand boy who points after the girl,
gesticulating as he does so. What drama is this being played? Is
the girl a crook, and are the shadowers detectives preparing to
arrest her? Or are _they_ the scoundrels, and are they plotting to
attack an innocent victim? What does the great detective say?”

“The great detective, _mon ami_, chooses, as ever, the simplest
course. He rises to see for himself.” And my friend joined me at
the window.


In [170]:
# using keras' `text_to_word_sequence` function to tokenise the text
# #additional characters added to filters to take out weird quotes
cleaned_tokens = text_to_word_sequence(Poirot, filters='“”•!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')

print(f'There are {len(cleaned_tokens)} words in the cleaned version of Poirot Investigates ({len(set(cleaned_tokens))} unique words).')
print('------------------------------------------------------------------------------------------------------')
print(f'Sample of 10 tokens: {cleaned_tokens[:10]}')

There are 52809 words in the cleaned version of Poirot Investigates (6379 unique words).
------------------------------------------------------------------------------------------------------
Sample of 10 tokens: ['the', 'adventure', 'of', 'the', 'western', 'star', 'i', 'was', 'standing', 'at']


In [171]:
#make sequences of 36 tokens
length = 36
sequences = list()
for i in range(length, len(cleaned_tokens)):
	# select sequence of tokens
	seq = cleaned_tokens[i-length:i]
	# convert into a line
	line = ' '.join(seq)
	# store
	sequences.append(line)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 52773


In [172]:
# defining tokenizer class to encode word sequences
encoder = Tokenizer()
encoder.fit_on_texts(sequences)
encoded_lines = encoder.texts_to_sequences(sequences)
print('--first line of text shown below in encoded form--')
print(encoded_lines[0])
#saving encoder for use at time of generation
dump(encoder, open('./encoder.pkl', 'wb'))

--first line of text shown below in encoded form--
[1, 1590, 5, 1, 1113, 419, 4, 8, 656, 22, 1, 273, 5, 189, 821, 309, 50, 3086, 24, 1, 597, 1291, 216, 1290, 4, 1112, 256, 1111, 20, 728, 37, 13, 117, 140, 176, 15]


In [173]:
word_index = encoder.word_index
print('printing 10 values from the `word_index` dictionary')
print('---------------------------------------------------')
dict(itertools.islice(word_index.items(), 10))

printing 10 values from the `word_index` dictionary
---------------------------------------------------


{'a': 3,
 'and': 6,
 'he': 9,
 'i': 4,
 'in': 7,
 'of': 5,
 'the': 1,
 'to': 2,
 'was': 8,
 'you': 10}

In [174]:
# defining numer of words in vocabulary
word_dim = len(set(cleaned_tokens))+1
# making encoded lines into numpy array so that model can process
encoded_lines = np.array(encoded_lines)

# defining predictor and response variables full dataset for full training
X, y = encoded_lines[:,:-1], encoded_lines[:,-1]

#splitting to train and validate for tuning
X_train, y_train = encoded_lines[:45000,:-1], encoded_lines[:45000,-1]
X_val, y_val = encoded_lines[45000:,:-1], encoded_lines[45000:,-1]

# one hot encoding response variable to fit with softmax function dense layer
y = to_categorical(y, num_classes=word_dim)
y_train = to_categorical(y_train, num_classes=word_dim)
y_val = to_categorical(y_val, num_classes=word_dim)

# saving the sequence length for later use
sequence_len = X.shape[1]
sequence_len_train = X_train.shape[1]

**The model is defined**

In [175]:
X.shape

(52773, 35)

In [93]:
# define tuner model

def build_model(hp):
  #sequential less flexible but using as it allows `predict_classes` later on
  model = Sequential()

  #adding embedding layer to project words into higher dimensional space
  model.add(Embedding(word_dim, hp.Int('n_dimensions', 50, 200, 50), input_length=sequence_len))
  
  # tuning number of return sequence layers alongside number of units in each layer
  for i in range(hp.Int('n_layers', 1, 3)):
    model.add(LSTM(hp.Int('n_units', 64, 256, 64), return_sequences=True))

  model.add(LSTM(hp.Int('n_units', 64, 256, 64)))

  model.add(Dense(hp.Int('n_units', 64, 256, 64), activation='relu'))

  #softmax dense layer to output probability distribution for each word
  model.add(Dense(word_dim, activation='softmax'))

  print(model.summary())

  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

  return model

In [94]:
# setting up tuner object and how long to tune for
seed = 1 # seed to improve reproducability
executions_per_trial = 2
max_trials = 3
tuner = RandomSearch(build_model, max_trials = max_trials , objective = 'val_accuracy', seed=seed, executions_per_trial = executions_per_trial, directory = '/content/drive/MyDrive/LOG_DIR_Poirot')

INFO:tensorflow:Reloading Oracle from existing project /content/drive/MyDrive/LOG_DIR_Poirot/untitled_project/oracle.json
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 30, 50)            319000    
_________________________________________________________________
lstm (LSTM)                  (None, 30, 64)            29440     
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dense_1 (Dense)              (None, 6380)              414700    
Total params: 800,324
Trainable params: 800,324
Non-trainable params: 0
____________________________________________________________

In [72]:
tuner.search_space_summary()

Search space summary
Default search space size: 3
n_dimensions (Int)
{'default': None, 'conditions': [], 'min_value': 50, 'max_value': 200, 'step': 50, 'sampling': None}
n_layers (Int)
{'default': None, 'conditions': [], 'min_value': 1, 'max_value': 3, 'step': 1, 'sampling': None}
n_units (Int)
{'default': None, 'conditions': [], 'min_value': 64, 'max_value': 256, 'step': 64, 'sampling': None}


In [73]:
# executing search with validation data to evaluate
tuner.search(X_train, y_train, epochs=10, batch_size=128, validation_data=(X_val,y_val))

Trial 3 Complete [00h 01m 49s]
val_accuracy: 0.09318123757839203

Best val_accuracy So Far: 0.09318123757839203
Total elapsed time: 00h 05m 39s
INFO:tensorflow:Oracle triggered exit


**The best model chosen by keras tuner is printed below. A smaller model with the least layers available in the search was chosen, and a mid value of 128 units per lstm layer. Since the data is very small, perhaps the deeper more complex models were quickly overfitting.**

In [74]:
#saving best model once tuning complete
best_model = tuner.get_best_models(num_models=1)[0]

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 6, 100)            638000    
_________________________________________________________________
lstm (LSTM)                  (None, 6, 128)            117248    
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 128)               16512     
_________________________________________________________________
dense_1 (Dense)              (None, 6380)              823020    
Total params: 1,726,364
Trainable params: 1,726,364
Non-trainable params: 0
_________________________________________________________________
None


**The below is code from Keras for transfer learning of gloVe word embeddings**

In [129]:
# !wget http://nlp.stanford.edu/data/glove.6B.zip
# !unzip -q glove.6B.zip


path_to_glove_file = "./glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))


num_tokens = len(set(cleaned_tokens)) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

from keras.initializers import  Constant

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=Constant(embedding_matrix),
    trainable=False,
)

**The chosen model is defined below and trained on the full book**

In [184]:
input_layer = Input(shape=(sequence_len,))

#adding embedding layer to project words into higher dimensional space
embedded = embedding_layer(input_layer)

lstm1 = LSTM(128, return_sequences=True)(embedded)

lstm2 = LSTM(128, return_sequences=True)(lstm1)

lstm3 = LSTM(128)(lstm2)

dense = Dense(128, activation='relu')(lstm3)

#softmax dense layer to output probability distribution for each word
softmax = Dense(word_dim, activation='softmax')(dense)

model = Model(inputs=input_layer, outputs=softmax)

print(model.summary())

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Model: "model_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_9 (InputLayer)         [(None, 35)]              0         
_________________________________________________________________
embedding_2 (Embedding)      multiple                  638100    
_________________________________________________________________
lstm_20 (LSTM)               (None, 35, 128)           117248    
_________________________________________________________________
lstm_21 (LSTM)               (None, 35, 128)           131584    
_________________________________________________________________
lstm_22 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dense_20 (Dense)             (None, 128)               16512     
_________________________________________________________________
dense_21 (Dense)             (None, 6380)              8230

**The model is now trained with the full data for 50 epochs. 50 Epochs would likely overfit when evaluated with the validation set used previously, but since the model is not being evaluated on test data and accuracy, rather plausability, this doesn't matter so much and experimentation showed longer training to give more plausible results in the style of agatha christie.**

In [194]:
model.fit(X, y, batch_size=64, epochs=500)

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7fb466a52ac8>

In [140]:
# saving model weights to allow loading for predictions at later date
model.save_weights('/content/drive/MyDrive/Poirot.h5')

In [160]:
#loading model weights from previous training run
model = model.load_weights('/content/drive/MyDrive/Poirot.h5')
print('--------------------')
print('model weights loaded')
print('--------------------')

--------------------
model weights loaded
--------------------


In [195]:
print(model.summary())

Model: "model_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_9 (InputLayer)         [(None, 35)]              0         
_________________________________________________________________
embedding_2 (Embedding)      multiple                  638100    
_________________________________________________________________
lstm_20 (LSTM)               (None, 35, 128)           117248    
_________________________________________________________________
lstm_21 (LSTM)               (None, 35, 128)           131584    
_________________________________________________________________
lstm_22 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dense_20 (Dense)             (None, 128)               16512     
_________________________________________________________________
dense_21 (Dense)             (None, 6380)              8230

In [None]:
#loading the encoder used previously
encoder = load(open('encoder.pkl', 'rb'))
print('--------------------')
print('encoder loaded')
print('--------------------')

In [187]:
import builtins
def predict_next():
  #taking input as seed
  seed=str(builtins.input('please enter some words (preferably 6 or more) as input to the model: '))
  #cleaning input seed in same way as training text cleaned
  seed=text_to_word_sequence(seed, filters='“”•!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')

  #displaying seed
  print('--------------------------------------------------------------')
  print(f'input_seed = {seed}')
  print('--------------------------------------------------------------')

  #encoding seed with encoder used for training
  seed_encoded = encoder.texts_to_sequences([seed])[0]
  #padding the seed so that inputs of varying sizes become 3 (longer takes last 3 words)
  seed_encoded = pad_sequences([seed_encoded], maxlen=sequence_len, truncating='pre')
  #prdicting next word using model
  next = np.argmax(model.predict(seed_encoded), axis=-1)
  next_word = ''
  for word, index in encoder.word_index.items():
    if index == next:
      next_word = word
      break
  print(f'Generated next word: *{next_word}*')
  print('--------------------------------------------------------------')
  print(f'New text: {" ".join(seed)} *{next_word}*')

def generate_text():
  #defining empty list for text to be stored in iteratively
  text = list()
  #taking input as seed
  input_seed = str(builtins.input('please enter some words (preferably 6 or more) as input to the model: '))
  #use input of number of words for generated sentence (validated)
  while True: 
      try:
          text_len = int(builtins.input("Please enter a number for the length of text to be generated: "))
          break
      except ValueError:
          print("Please enter a number...")
  #cleaning input seed in same way as training text cleaned
  seed=text_to_word_sequence(input_seed, filters='“”•!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')

  #displaying seed input
  print('--------------------------------------------------------------')
  print(f'input seed = {input_seed}')
  print('--------------------------------------------------------------')

  # iteratively predicting and appending new words to seed to generate text
  for i in range(text_len):
    #encoding seed using generator used previously
    seed_encoded = encoder.texts_to_sequences([seed])[0]
    seed_encoded = pad_sequences([seed_encoded], maxlen=sequence_len, truncating='pre')
    next = np.argmax(model.predict(seed_encoded), axis=-1)
    next_word = ''
    for word, index in encoder.word_index.items():
      if index == next:
        next_word = word
        break
    seed += ' ' + ''.join(next_word)
    text.append(next_word)
  print(f'Generated next word: *{text}*')
  print('--------------------------------------------------------------')
  print(f"""New text: {input_seed + " " + " ".join(text)}""")

In [181]:
predict_next()

please enter some words (preferably 6 or more) as input to the model: As usual, Poirot was right. After a short interval, the American film star was ushered in, and we rose
--------------------------------------------------------------
input_seed = ['as', 'usual', 'poirot', 'was', 'right', 'after', 'a', 'short', 'interval', 'the', 'american', 'film', 'star', 'was', 'ushered', 'in', 'and', 'we', 'rose']
--------------------------------------------------------------
Generated next word: *him*
--------------------------------------------------------------
New text: as usual poirot was right after a short interval the american film star was ushered in and we rose *him*


In [206]:
generate_text()

please enter some words (preferably 6 or more) as input to the model: I looked with some curiosity at “Alfred darling”. He certainly struck a rather alien note. I did not wonder at John objecting to his beard. It was one of the
Please enter a number for the length of text to be generated: 20
--------------------------------------------------------------
input seed = I looked with some curiosity at “Alfred darling”. He certainly struck a rather alien note. I did not wonder at John objecting to his beard. It was one of the
--------------------------------------------------------------
Generated next word: *['greatest', 'left', 'with', 'see', 'these', 'i', 'pray', 'little', 'with', 'suggested', 'précisément', 'tosswill', 'pretty', 'and', 'that', 'shame', 'but', 'that', 'pair', 'gently']*
--------------------------------------------------------------
New text: I looked with some curiosity at “Alfred darling”. He certainly struck a rather alien note. I did not wonder at John objecting to hi

In [204]:
sequences[361]

'know it well you really have the best opinion of yourself of anyone i ever knew i cried divided between amusement and annoyance what will you when one is unique one knows it and others share'

https://keras.io/examples/nlp/pretrained_word_embeddings/