<a href="https://colab.research.google.com/github/jhmuller/nextWord/blob/main/nextWord.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What
Predict the next word given a sequence of words.

# Why
If the predictions are good and users can easily select
the right one then this is a clear win for users of
texting, email, word processing and more.

#  Background
While texting the other day, and seeing the word suggestions, 
I started to wonder how easy is it to predict those words.

I was hoping to find a corpus of texts to train on
but could not find one.  
What I did find was the Lambada data set.
It advertises that it is good for this purpose.
For more information about the data set see  
https://arxiv.org/abs/1606.06031


I modeled this after an article I found attempting to do much the same. 

https://towardsdatascience.com/next-word-prediction-with-nlp-and-deep-learning-48b9fe0a17bf

In [1]:
import requests
import re
import os
import string
import numpy as np
import sklearn

## Fetch the Lambada data set
which I have stored on my Google drive

In [2]:
from google.colab import drive
try:
  os.listdir("/content/drive")
except:
  drive.mount('/content/drive')
ddir = "/content/drive/MyDrive/data"
fname = 'lambada_development_plain_text.txt'
fpath = os.path.join(ddir, fname)
with open(fpath, 'r') as fp:
  orig_text = fp.read()
len(orig_text)
orig_text = orig_text[:700000]

Mounted at /content/drive


## Clean up
Get rid of some punctuation
and delete the variables no longer needed since the data is large.

Also, split into sentences, 
so the sentences variableis an array of sentences.

In [3]:

clean_text = re.sub("\'|\,|\?|\!", '', orig_text)
cleaner_text = re.sub("\.", '|', clean_text)
sentences = cleaner_text.split('|')
print(len(sentences))
print(sentences[:5])
del orig_text
del clean_text
del cleaner_text


8937
['her pay for the evening was almost double that of the wait staff and although that might not seem like a lot to some people  it was a small fortune to claire ', ' after loading her final tray for a server  claire went to the restroom to freshen up and begin preparations for being loaded into the cake ', ' pam had a couple of young men from college who assisted her into the cake ', ' brian and max were a lot of fun and always made her laugh as they hoisted her up to the top of the cake\n`` nineteen   she said  and he loosed a breath that could have been sadness or relief or maybe both  and told her that made her magic even more impressive ', ' she debated saying that he would be less impressed once he learned of her nickname for him  but winked at him instead ']


## Train and Test
split into train and test.
I just use some initial portion for train
and the rest for test.

In [4]:
N = len(sentences)
cutoff = int(np.floor(0.7*N))
train = sentences[:cutoff]
test = sentences[cutoff:]

print(len(train))
print(len(test))

6255
2682


In [5]:
print(train[10:])
print(test[-5:])

[' aspen ran her hands down the front of his jacket  smoothing out invisible wrinkles ', ' she always wanted him to look perfect ', ' `` of course you will be  jackson ', ' what other option is there   he nodded his head  conceding to her words ', ' `` your fait']


## Prepare Model Input
Here I manually create the X and y inputs for the model. This was probably the most involved part of the project. 

Note from above that I split the data into sentences, that is,
the data is a sequence of sentences. 
I did this because I felt that I might get better results
if I did not let the data run across sentences.
This was part of the complication mentioned above.

After creating the X and y token arrays I do a 
one-hot encoding, using to_categorical, of y since the output will be 
a probability for each work in the word count for train

In [6]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(train)
vocab_size = len(tokenizer.word_index) + 1
NGRAM_LEN = 4
def make_Xy(data, tokenizer, ngram_len=1):
  sequence_data = tokenizer.texts_to_sequences(data)
  print(len(sequence_data))
  sequences = []
  X = []
  y = []
  for sentence in sequence_data:
    if len(sentence) < NGRAM_LEN+1:
      continue
    for i in range(1, len(sentence)-ngram_len-1):
      words = sentence[i-1:i+ngram_len]
      X.append(words[:-1])
      y.append(words[-1])

  X = np.array(X)
  y = np.array(y)

  from  tensorflow.keras.utils import to_categorical
  vocab_size = len(tokenizer.word_index) + 1
  print(vocab_size)
  ycat = to_categorical(y, num_classes=vocab_size)
  print(ycat[0].shape)
  return X, ycat, y

Xtrain, ytrain, yindstrain = make_Xy(train, tokenizer, NGRAM_LEN)
Xtest, ytest, yindstest = make_Xy(test, tokenizer, NGRAM_LEN)

6255
9616
(9616,)
2682
9616
(9616,)


In [7]:
print(tokenizer.sequences_to_texts([yindstest[-15:]]))
#tokenizer.sequences_to_texts(Xtest[-15:])

['father shook down the front of his jacket to there he nodded his head to']


## Define the model.
I mostly kept the same model as the article 
with a few exceptions.  I increased the dimension
of the word embedding 
I also allowed for any number of 
words in the predicting sequence, which I call the NGRAM_LEN.  The article just used 1.
I tried 3 and 4.

In [8]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

model = Sequential()
model.add(Embedding(vocab_size, 20, input_length=NGRAM_LEN))
model.add(LSTM(1000, return_sequences=True))
model.add(LSTM(1000))
model.add(Dense(1000, activation="relu"))
model.add(Dense(vocab_size, activation="softmax"))

## Model Summary
Print out summary of the model.
Make sure that the initial input and final output
make sense given the data.

In [9]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 4, 20)             192320    
                                                                 
 lstm (LSTM)                 (None, 4, 1000)           4084000   
                                                                 
 lstm_1 (LSTM)               (None, 1000)              8004000   
                                                                 
 dense (Dense)               (None, 1000)              1001000   
                                                                 
 dense_1 (Dense)             (None, 9616)              9625616   
                                                                 
Total params: 22,906,936
Trainable params: 22,906,936
Non-trainable params: 0
_________________________________________________________________


## Callbacks
the only one I used was ReduceLROnPlateau
but I don't think it ever actually reduced 
the learning rate in any of my runs.

In [10]:
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.losses import SparseCategoricalCrossentropy

reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=3, min_lr=0.0001, verbose = 1)
#tensorboard_Visualization = TensorBoard(log_dir=logdir)

## Compile and Fit
Loss is Categorical CrossEntropy and I used the Adam optimizer.
I used pretty standard values for hyperparameters such as  
learning rate and batch size.





In [11]:
from tensorflow.keras.optimizers import Adam
model.compile(loss='categorical_crossentropy', 
              metrics=['accuracy'], 
              optimizer=Adam(learning_rate=0.001))
model.fit(Xtrain, ytrain, epochs=25, batch_size=64, callbacks=[reduce_lr])

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7fa0b627be80>

## Evaluate
The model seems to do well on the training set, accuracy  of 93%.

But what is important is how well it does on unseen data.

So next I evaluate on the test set.

In [12]:
model.evaluate(Xtest, ytest)



[22.17169189453125, 0.07291167974472046]

## Out of Sample Performance
over 90% accuracy on the training set but less than 10% accuracy on the test set.  I would say that is pretty disappointing.
Note that I also tried using the text from the novel The Sun Also Rises by Hemmingway which I got from project Guttenberg. I saw similar performance on that.

I don't think more epochs would help at all here.
Perhaps more LSTM layers or some different architecture would be better.

Clearly the problem can be solved in a limited domain, the word suggestions when I am texting are pretty good.  Note that I usually get 3 suggestions and above we are evaluating on only the most likely.

Maybe I will try to find a text message data set and try again.

## Look at some predictions
Maybe interesting to look at the predicted words
as well as the true words

In [13]:
x_data = Xtest
y_data = ytest

vpreds = model.predict(x_data[:20])
pred_inds = [np.argmax(x) for x in vpreds]
pred_words = tokenizer.sequences_to_texts([pred_inds])
print(f"predicted_words: {pred_words}")


true_inds = [np.argmax(x) for x in y_data[:20]]
true_words = tokenizer.sequences_to_texts([true_inds])
print(f"true_words: {true_words}")

predicted_words: ['person the m been to have in especially save seat to go do normally free whole of the her in']
true_words: ['but i d need to work hard to get ready and i d be the out of breath hiding in']


It got a few right, but still very disappointing.