#### Project Description: Next Word Prediction Using LSTM

This project aims to develop a deep learning model for predicting the next word in a given sequence of words. The model is built using Long Short-Term Memory (LSTM) networks, which are well-suited for sequence prediction tasks. The project includes the following steps:

1- Data Collection: We use the text of Shakespeare's "Hamlet" as our dataset. This rich, complex text provides a good challenge for our model.

2- Data Preprocessing: The text data is tokenized, converted into sequences, and padded to ensure uniform input lengths. The sequences are then split into training and testing sets.

3- Model Building: An LSTM model is constructed with an embedding layer, two LSTM layers, and a dense output layer with a softmax activation function to predict the probability of the next word.

4- Model Training: The model is trained using the prepared sequences, with early stopping implemented to prevent overfitting. Early stopping monitors the validation loss and stops training when the loss stops improving.

5- Model Evaluation: The model is evaluated using a set of example sentences to test its ability to predict the next word accurately.

6- Deployment: A Streamlit web application is developed to allow users to input a sequence of words and get the predicted next word in real-time.

In [1]:
# Data collection 
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
import pandas as pd 

# Load the dataset 
data = gutenberg.raw('shakespeare-hamlet.txt')
# Save to file 
with open('hamlet.txt', 'w') as file:
    file.write(data)
    

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\indra\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [2]:
# Data preprocessing 
import numpy as np 
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Load the dataset 
with open('hamlet.txt', 'r') as file:
    text = file.read().lower() 

# Tokenize the text - creating indexes for words 
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
# tokenizer.word_index
total_words = len(tokenizer.word_index) + 1




In [3]:
# Create input sequences 
input_seq = []
for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    # print(token_list)
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_seq.append(n_gram_sequence)

In [4]:
# Pad sequences 
max_sequence_len = max([len(x) for x in input_seq])
input_sequences = np.array(pad_sequences(input_seq, maxlen=max_sequence_len, padding = 'pre'))

In [5]:
# Create predictors and label 
import tensorflow as tf 
# input_sequences[:, :-1] → takes all rows and all columns except the last
# input_sequences[:, -1] → takes all rows and only the last column 
x, y = input_sequences[:, :-1], input_sequences[:, -1]

# transforms each label into a vector of length = total_words 
# 1 is at the index of the label and all others are 0
y = tf.keras.utils.to_categorical(y, num_classes=total_words)
print(y)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [6]:
# Split the data into training and testing sets 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [7]:
# Train our LSTM RNN
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Define the model 
model = Sequential()
# Turns word indexes into vectors
# total_words is vocabulary size, each word mappe to 100 dimension vector
model.add(Embedding(total_words, 100, input_length = max_sequence_len-1))
# First LSTM layer, returns full sequence of outputs
model.add(LSTM(150, return_sequences = True))
# randomly dropout 20% neurons during training to prevent overfitting
model.add(Dropout(0.2))
# returns the final hidden state - used for prediction
model.add(LSTM(100))
# Output layer - softmax turns output into probabilities
model.add(Dense(total_words, activation = 'softmax'))

# Compile the model 
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=['accuracy'])
model.summary()




Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 13, 100)           481800    
                                                                 
 lstm (LSTM)                 (None, 13, 150)           150600    
                                                                 
 dropout (Dropout)           (None, 13, 150)           0         
                                                                 
 lstm_1 (LSTM)               (None, 100)               100400    
                                                                 
 dense (Dense)               (None, 4818)              486618    
                                                                 
Total params: 1219418 (4.65 MB)
Trainable params: 1219418 (4.65 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [8]:
# Compile the model 
history = model.fit(x_train, y_train, epochs = 50, validation_data=(x_test, y_test), verbose = 1)

Epoch 1/50


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [21]:
from helper import predict_next_word

In [23]:
input_text = "Is not this"
print(model.input_shape)
max_sequence_length = model.input_shape[1] + 1
next_word = predict_next_word(model, tokenizer, input_text, max_sequence_length)
print(f"Next word prediction: {next_word}")

(None, 13)


NameError: name 'pad_sequences' is not defined

In [20]:
# Save the model 
model.save('next_word_lstm_model.h5')
import pickle 
with  open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)