## Project Description: Next Word Prediction Using LSTM
#### Project Overview:

This project aims to develop a deep learning model for predicting the next word in a given sequence of words. The model is built using Long Short-Term Memory (LSTM) networks, which are well-suited for sequence prediction tasks. The project includes the following steps:

1- Data Collection: We use the text of Shakespeare's "Hamlet" as our dataset. This rich, complex text provides a good challenge for our model.

2- Data Preprocessing: The text data is tokenized, converted into sequences, and padded to ensure uniform input lengths. The sequences are then split into training and testing sets.

3- Model Building: An LSTM model is constructed with an embedding layer, two LSTM layers, and a dense output layer with a softmax activation function to predict the probability of the next word.

4- Model Training: The model is trained using the prepared sequences, with early stopping implemented to prevent overfitting. Early stopping monitors the validation loss and stops training when the loss stops improving.

5- Model Evaluation: The model is evaluated using a set of example sentences to test its ability to predict the next word accurately.

6- Deployment: A Streamlit web application is developed to allow users to input a sequence of words and get the predicted next word in real-time.

In [1]:
# Data Collection
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
import pandas as pd
import numpy as np

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\lalra\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [2]:
# load the dataset hamlet.txt
hamlet = gutenberg.raw('shakespeare-hamlet.txt')
#carriage returns are sometimes used in text files to indicate the end of a line, and they can cause issues when processing text data. By removing them, we ensure that the text is clean and easier to work with.

#save the dataset to a text file
with open('hamlet.txt', 'w', encoding='utf-8') as f:
    f.write(hamlet)

In [3]:
#Data Processing
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import string

#load the dataset
##laod the dataset
with open('hamlet.txt','r') as file:
    text=file.read().lower()

print(text)

[the tragedie of hamlet by william shakespeare 1599]


actus primus. scoena prima.

enter barnardo and francisco two centinels.

  barnardo. who's there?
  fran. nay answer me: stand & vnfold
your selfe

   bar. long liue the king

   fran. barnardo?
  bar. he

   fran. you come most carefully vpon your houre

   bar. 'tis now strook twelue, get thee to bed francisco

   fran. for this releefe much thankes: 'tis bitter cold,
and i am sicke at heart

   barn. haue you had quiet guard?
  fran. not a mouse stirring

   barn. well, goodnight. if you do meet horatio and
marcellus, the riuals of my watch, bid them make hast.
enter horatio and marcellus.

  fran. i thinke i heare them. stand: who's there?
  hor. friends to this ground

   mar. and leige-men to the dane

   fran. giue you good night

   mar. o farwel honest soldier, who hath relieu'd you?
  fra. barnardo ha's my place: giue you goodnight.

exit fran.

  mar. holla barnardo

   bar. say, what is horatio there?
  hor. a peece of

In [4]:
# The Tokenizer in Keras is not a vector representation of the vocabulary but rather a utility that creates a mapping
# (key-value pair) between words and integers.
tokenizer = Tokenizer()
# Fit the tokenizer on the text data
# The fit_on_texts method updates the internal vocabulary based on the list of texts. It creates a word index dictionary
tokenizer.fit_on_texts([text])
total_words=len(tokenizer.word_index)+1 # +1 for padding token
# The word_index attribute is a dictionary mapping words to their index in the vocabulary.
# The total number of unique words in the vocabulary is obtained by getting the length of the word_index dictionary and adding 1 for the padding token.
# Convert the text to sequences of integers
print("Total words in vocabulary:", total_words)
print("Tokenizer word index:", tokenizer.word_index)


Total words in vocabulary: 4818


In [5]:
# Create a input sequence and its corresponding output word
# We need to create input sequences and their corresponding output words for training the model. 
# This is done by creating n-grams of words. n-gram is a contiguous sequence of n items from a given sample of text or speech.
input_sequences = []
for line in text.split('\n'):  # Split the text into sentences using '.' as the delimiter

    # Tokenize the line into words
    # The texts_to_sequences method converts the text to a sequence of integers based on the word index created by the tokenizer.
    token_list = tokenizer.texts_to_sequences([line])[0]

    # Create n-grams of words
    # n-grams are contiguous sequences of n items from a given sample of text or speech.
    # Here, we are creating n-grams of size 1 to the length of the tokenized line.
    # The input sequences are created by taking the first i words as input and the (i+1)th word as the output.
    for i in range(1,len(token_list)):
        n_gram_sequence=token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In [6]:
# Display the first 5 input sequences
input_sequences

#[1, 687] : represents the first word in the sequence and the second word in the sequence.
# The input_sequences list contains sequences of integers representing the words in the text.

[[1, 687],
 [1, 687, 4],
 [1, 687, 4, 45],
 [1, 687, 4, 45, 41],
 [1, 687, 4, 45, 41, 1886],
 [1, 687, 4, 45, 41, 1886, 1887],
 [1, 687, 4, 45, 41, 1886, 1887, 1888],
 [1180, 1889],
 [1180, 1889, 1890],
 [1180, 1889, 1890, 1891],
 [57, 407],
 [57, 407, 2],
 [57, 407, 2, 1181],
 [57, 407, 2, 1181, 177],
 [57, 407, 2, 1181, 177, 1892],
 [407, 1182],
 [407, 1182, 63],
 [408, 162],
 [408, 162, 377],
 [408, 162, 377, 21],
 [408, 162, 377, 21, 247],
 [408, 162, 377, 21, 247, 882],
 [18, 66],
 [451, 224],
 [451, 224, 248],
 [451, 224, 248, 1],
 [451, 224, 248, 1, 30],
 [408, 407],
 [451, 25],
 [408, 6],
 [408, 6, 43],
 [408, 6, 43, 62],
 [408, 6, 43, 62, 1893],
 [408, 6, 43, 62, 1893, 96],
 [408, 6, 43, 62, 1893, 96, 18],
 [408, 6, 43, 62, 1893, 96, 18, 566],
 [451, 71],
 [451, 71, 51],
 [451, 71, 51, 1894],
 [451, 71, 51, 1894, 567],
 [451, 71, 51, 1894, 567, 378],
 [451, 71, 51, 1894, 567, 378, 80],
 [451, 71, 51, 1894, 567, 378, 80, 3],
 [451, 71, 51, 1894, 567, 378, 80, 3, 273],
 [451, 71

In [7]:
#Pad sequences
# Padding is the process of adding zeros to the sequences to make them of equal length.
# This is necessary because the input to the model should be of the same shape.
max_sequence_len=max([len(x) for x in input_sequences])
max_sequence_len

14

In [8]:
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
# The pad_sequences function is used to pad the sequences to the same length.
input_sequences

array([[   0,    0,    0, ...,    0,    1,  687],
       [   0,    0,    0, ...,    1,  687,    4],
       [   0,    0,    0, ...,  687,    4,   45],
       ...,
       [   0,    0,    0, ...,    4,   45, 1047],
       [   0,    0,    0, ...,   45, 1047,    4],
       [   0,    0,    0, ..., 1047,    4,  193]])

In [9]:
# create the predictors and label : this is done by splitting the input sequences into predictors and labels.
# input_sequences[:,:-1] : this selects all rows and all columns except the last one.
# input_sequences[:,-1] : this selects all rows and the last column.
# x is the input data(predictors) and y is the output data(label).
x, y = input_sequences[:,:-1],input_sequences[:,-1]

In [10]:
x

array([[   0,    0,    0, ...,    0,    0,    1],
       [   0,    0,    0, ...,    0,    1,  687],
       [   0,    0,    0, ...,    1,  687,    4],
       ...,
       [   0,    0,    0, ...,  687,    4,   45],
       [   0,    0,    0, ...,    4,   45, 1047],
       [   0,    0,    0, ...,   45, 1047,    4]])

In [11]:
y

array([ 687,    4,   45, ..., 1047,    4,  193])

In [12]:
import tensorflow as tf
# num_classes=total_words : this specifies the number of classes for the output layer.
# The output layer will have a number of neurons equal to the number of unique words in the vocabulary.

y = tf.keras.utils.to_categorical(y, num_classes=total_words)
# The to_categorical function is used to convert the labels to one-hot encoded format.
y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [13]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Normalize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# The StandardScaler is used to standardize the features by removing the mean and scaling to unit variance.

In [14]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((20585, 13), (5147, 13), (20585, 4818), (5147, 4818))

In [15]:
# Define early stopping
from tensorflow.keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

In [16]:
# No train the LSTM RNN model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam

# Define or create the model
model=Sequential()
model.add(Embedding(total_words,100,input_length=max_sequence_len-1))
# return_sequences=True : this specifies that the output of the LSTM layer should be a sequence.
model.add(LSTM(150,return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(100))
# total_words : this specifies the number of classes for the output layer.
model.add(Dense(total_words,activation="softmax"))

# Compile the model
# model.compile(loss="categorical_crossentropy",optimizer='adam',metrics=['accuracy'])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])



In [17]:
print("Total words in vocabulary:", total_words)

Total words in vocabulary: 4818


In [18]:
X_train

array([[-0.01044608, -0.0171217 , -0.03001942, ..., -0.30158665,
         0.28111719, -0.46141797],
       [-0.01044608, -0.0171217 , -0.03001942, ..., -0.39419138,
         1.46821341, -0.20310179],
       [-0.01044608, -0.0171217 , -0.03001942, ..., -0.34617411,
         1.01775279, -0.46042825],
       ...,
       [-0.01044608, -0.0171217 , -0.03001942, ..., -0.39419138,
        -0.07925131, -0.41886013],
       [-0.01044608, -0.0171217 , -0.03001942, ..., -0.39419138,
        -0.36754611, -0.44459278],
       [-0.01044608, -0.0171217 , -0.03001942, ..., -0.36217987,
        -0.42796082, -0.47329458]])

In [19]:
print("Max value in X_train:", np.max(X_train))


Max value in X_train: 125.59270349345258


In [20]:

# # ensure the proper dtype for the data
# X_train = np.array(X_train).astype(np.float32)
# X_test = np.array(X_test).astype(np.float32)
# y_train = np.array(y_train).astype(np.float32)
# y_test = np.array(y_test).astype(np.float32)
# # priint the X_train highest and lowest values
# print("Max value in X_train:", np.max(X_train))
# print("Min value in X_train:", np.min(X_train))
# print("Max value in X_test:", np.max(X_test))
# print("Min value in X_test:", np.min(X_test))

In [24]:
from tensorflow.keras.layers import GRU

## GRU RNN
## Define the model
model=Sequential()
model.add(Embedding(total_words,100,input_length=max_sequence_len-1))
model.add(GRU(150,return_sequences=True))
model.add(Dropout(0.2))
model.add(GRU(100))
model.add(Dense(total_words,activation="softmax"))

# #Compile the model
model.compile(loss="categorical_crossentropy",optimizer='adam',metrics=['accuracy'])
model.summary()

In [21]:
y_train = np.argmax(y_train, axis=1)  # Convert one-hot to integer labels
y_test = np.argmax(y_test, axis=1)

In [23]:
history=model.fit(X_train,y_train,epochs=100,batch_size=64,validation_data=(X_test,y_test),callbacks=[early_stopping],verbose=1)

Epoch 1/100


OverflowError: Python int too large to convert to C long

In [25]:
# Function to predict the next word
def predict_next_word(model, tokenizer, text, max_sequence_len):
    token_list = tokenizer.texts_to_sequences([text])[0]
    if len(token_list) >= max_sequence_len:
        token_list = token_list[-(max_sequence_len-1):]  # Ensure the sequence length matches max_sequence_len-1
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted = model.predict(token_list, verbose=0)
    predicted_word_index = np.argmax(predicted, axis=1)
    for word, index in tokenizer.word_index.items():
        if index == predicted_word_index:
            return word
    return None

In [None]:
input_text="To be or not to be"
print(f"Input text:{input_text}")
max_sequence_len=model.input_shape[1]+1
next_word=predict_next_word(model,tokenizer,input_text,max_sequence_len)
print(f"Next Word PRediction:{next_word}")

Input text:To be or not to be


AttributeError: Sequential model 'sequential_1' has no defined input shape yet.

In [29]:
## Save the model
model.save("next_word_lstm.h5")
## Save the tokenizer
import pickle
with open('tokenizer.pickle','wb') as handle:
    pickle.dump(tokenizer,handle,protocol=pickle.HIGHEST_PROTOCOL)



In [30]:
input_text="  Barn. Last night of all,When yond same"
print(f"Input text:{input_text}")
max_sequence_len=model.input_shape[1]+1
next_word=predict_next_word(model,tokenizer,input_text,max_sequence_len)
print(f"Next Word PRediction:{next_word}")

Input text:  Barn. Last night of all,When yond same


AttributeError: Sequential model 'sequential_1' has no defined input shape yet.