<a href="https://colab.research.google.com/github/jhmuller/nextWord/blob/main/nextWord.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predict the next word

In this notebook I explore whether I can build a model to predict the next word from some number of previous words.

I will use Keras/Tensorflow to build the model and I will use 
the Lambda data set.  For background on Lambda see  
https://arxiv.org/abs/1606.06031

I modeled this after another attempt I found here  

https://towardsdatascience.com/next-word-prediction-with-nlp-and-deep-learning-48b9fe0a17bf

In [None]:
import requests
import re
import os
import string
import numpy as np
import sklearn

## Fetch the Lambada data set
which I have stored on my Google drive

In [None]:
from google.colab import drive
try:
  os.listdir("/content/drive")
except:
  drive.mount('/content/drive')
ddir = "/content/drive/MyDrive/data"
fname = 'lambada_development_plain_text.txt'
fpath = os.path.join(ddir, fname)
with open(fpath, 'r') as fp:
  orig_text = fp.read()
len(orig_text)
orig_text = orig_text[:700000]

## Clean up
Get rid of some punctuation
and delete the versions no longer needed

In [None]:

clean_text = re.sub("\'|\,|\?|\!", '', orig_text)
cleaner_text = re.sub("\.", '|', clean_text)
sentences = cleaner_text.split('|')
print(len(sentences))
print(sentences[:5])
del orig_text
del clean_text
del cleaner_text


11435
['her pay for the evening was almost double that of the wait staff and although that might not seem like a lot to some people  it was a small fortune to claire ', ' after loading her final tray for a server  claire went to the restroom to freshen up and begin preparations for being loaded into the cake ', ' pam had a couple of young men from college who assisted her into the cake ', ' brian and max were a lot of fun and always made her laugh as they hoisted her up to the top of the cake\n`` nineteen   she said  and he loosed a breath that could have been sadness or relief or maybe both  and told her that made her magic even more impressive ', ' she debated saying that he would be less impressed once he learned of her nickname for him  but winked at him instead ']


## Train and Test
split into train and test.
I just use some initial portion for train
and the rest for test.

In [None]:
N = len(sentences)
cutoff = int(np.floor(0.7*N))
train = sentences[:cutoff]
test = sentences[cutoff:]

print(len(train))
print(len(test))

8004
3431


In [None]:
print(train[10:])
print(test[-5:])

['j', ' shot his brother a glare and squealed  `` i m goin  with ya ', '  `` oh  no you re not   corrected buster ', ' `` mom and dad are gon na be mad enough when they find out i went ', '  `` we could drop  im off at jamie s house   suggested pete\nmab looked thankfully up at the heavens for a moment  laughing in delight  then back at frotwoot with a big silly grin on he']


## Prepare Model Input
Here I manually create the X and y inputs for the model.
First I tokenize the train set.
I create data only from sentences, that is, I do not let
the inputs run across a sentence.  I thought this would
make more sense.
After creating the X and y token arrays I do a 
one-hot encoding of y since the output will be 
a probability for each work in the word count for train

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(train)
vocab_size = len(tokenizer.word_index) + 1
NGRAM_LEN = 4
def make_Xy(data, tokenizer, ngram_len=1):
  sequence_data = tokenizer.texts_to_sequences(data)
  print(len(sequence_data))
  sequences = []
  X = []
  y = []
  for sentence in sequence_data:
    if len(sentence) < NGRAM_LEN+1:
      continue
    for i in range(1, len(sentence)-ngram_len-1):
      words = sentence[i-1:i+ngram_len]
      X.append(words[:-1])
      y.append(words[-1])

  X = np.array(X)
  y = np.array(y)

  from  tensorflow.keras.utils import to_categorical
  vocab_size = len(tokenizer.word_index) + 1
  print(vocab_size)
  ycat = to_categorical(y, num_classes=vocab_size)
  print(ycat[0].shape)
  return X, ycat, y

Xtrain, ytrain, yindstrain = make_Xy(train, tokenizer, NGRAM_LEN)
Xtest, ytest, yindstest = make_Xy(test, tokenizer, NGRAM_LEN)

8004
11005
(11005,)
3431
11005
(11005,)


In [None]:
print(tokenizer.sequences_to_texts([yindstest[-15:]]))
#tokenizer.sequences_to_texts(Xtest[-15:])

['the heavens for a moment laughing in then back at with a big silly grin']


## Define the model.
I mostly kept the same model as the article 
with a few exceptions.  I increased the dimension
of the word embedding and I allowed for any number of 
words in the predicting sequence.  I tried 3 and 4.

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

model = Sequential()
model.add(Embedding(vocab_size, 20, input_length=NGRAM_LEN))
model.add(LSTM(1000, return_sequences=True))
model.add(LSTM(1000))
model.add(Dense(1000, activation="relu"))
model.add(Dense(vocab_size, activation="softmax"))

In [None]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 4, 20)             220100    
                                                                 
 lstm_2 (LSTM)               (None, 4, 1000)           4084000   
                                                                 
 lstm_3 (LSTM)               (None, 1000)              8004000   
                                                                 
 dense_1 (Dense)             (None, 1000)              1001000   
                                                                 
 dense_2 (Dense)             (None, 11005)             11016005  
                                                                 
Total params: 24,325,105
Trainable params: 24,325,105
Non-trainable params: 0
_________________________________________________________________


## Callbacks
the only one I used was ReduceLROnPlateau
but I don't think it ever actually reduced 
the learning rate in any of my runs.

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.losses import SparseCategoricalCrossentropy

reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=3, min_lr=0.0001, verbose = 1)
#tensorboard_Visualization = TensorBoard(log_dir=logdir)

## Compile and Fit
I used pretty standard values for 
Note the categorical cross entropy loss.


In [None]:
from tensorflow.keras.optimizers import Adam
model.compile(loss='categorical_crossentropy', 
              metrics=['accuracy'], 
              optimizer=Adam(learning_rate=0.001))
model.fit(Xtrain, ytrain, epochs=16, batch_size=64, callbacks=[reduce_lr])

Epoch 1/16
 299/2243 [==>...........................] - ETA: 21:49 - loss: 7.0718 - accuracy: 0.0424

KeyboardInterrupt: ignored

## Evaluate
how well does it do on the unseen data

In [None]:
model.evaluate(Xtest, ytest)

## Look at some predictions
Maybe interesting to look at the predicted words
as well as the true words

In [None]:
x_data = Xtest
y_data = ytest

vpreds = model.predict(x_data[:20])
print(len(vpreds))
pred_inds = [np.argmax(x) for x in vpreds]
pred_words = tokenizer.sequences_to_texts([pred_inds])



true_inds = [np.argmax(x) for x in y_data[:20]]
true_words = tokenizer.sequences_to_texts([true_inds])
print(f"true_words: {true_words}")

20


['stairs the felt tall a great of about e the be d he went not realize anything to on my']

['bed i was having a bad time i heard them talking but i did not listen brett came in and']