# Next word prediction

BY : ***Pradeep Verma***

Next Word Prediction means predicting the most likely word or phrase that will come next in a sentence or text.<br> It is like having an inbuilt feature on an application that suggests the next word as you type or speak.<br> The Next Word Prediction Models are used in applications like messaging apps, search engines, virtual assistants, and autocorrect features on smartphones.

## Next Word Prediction model

1) Start by collecting a diverse dataset of text documents, 
2) Preprocess the data by cleaning and tokenizing it, 
3) Prepare the data by creating input-output pairs, 
4) Engineer features such as word embeddings, 
5) Select an appropriate model like an LSTM or GPT, 
6) Train the model on the dataset while adjusting hyperparameters,
7) Improve the model by experimenting with different techniques and architectures.

## Importing the libraries and dataset

In [None]:
#import the libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

In [None]:
# Read the text file
with open('sherlock-holm.es_stories_plain-text_advs.txt', 'r', encoding='utf-8') as file:
    text = file.read()

In [None]:
print(len(text))

## Tokenize the text

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1

In the above code, the text is tokenized, which means it is divided into individual words or tokens.<br> The ‘Tokenizer’ object is created, which will handle the tokenization process. <br>The ‘fit_on_texts’ method of the tokenizer is called, passing the ‘text’ as input. <br>This method analyzes the text and builds a vocabulary of unique words, assigning each word a numerical index.<br> The ‘total_words’ variable is then assigned the value of the length of the word index plus one, representing the total number of distinct words in the text.

## N-Grams

In [None]:
input_sequences = []
for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In the above code, the text data is split into lines using the ‘\n’ character as a delimiter. <br>For each line in the text, the ‘texts_to_sequences’ method of the tokenizer is used to convert the line into a sequence of numerical tokens based on the previously created vocabulary. <br>The resulting token list is then iterated over using a for loop.<br>For each iteration, a subsequence, or n-gram, of tokens is extracted, ranging from the beginning of the token list up to the current index ‘i’.
<br>This n-gram sequence represents the input context, with the last token being the target or predicted word. <br>This n-gram sequence is then appended to the ‘input_sequences’ list. <br>This process is repeated for all lines in the text, generating multiple input-output sequences that will be used for training the next word prediction model.

In [None]:
max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

In the above code, the input sequences are padded to ensure all sequences have the same length. <br>
The variable ‘max_sequence_len’ is assigned the maximum length among all the input sequences.  <br>
The ‘pad_sequences’ function is used to pad or truncate the input sequences to match this maximum length. <br>
The ‘pad_sequences’ function takes the input_sequences list, sets the maximum length to ‘max_sequence_len’, and specifies that the padding should be added at the beginning of each sequence using the ‘padding=pre’ argument. <br>
Finally, the input sequences are converted into a numpy array to facilitate further processing.



In [None]:
X = input_sequences[:, :-1]

y = input_sequences[:, -1]

In the above code, the input sequences are split into two arrays, ‘X’ and ‘y’, to create the input and output for training the next word prediction model.<br>The ‘X’ array is assigned the values of all rows in the ‘input_sequences’ array except for the last column. It means that ‘X’ contains all the tokens in each sequence except for the last one, representing the input context.<br>On the other hand, the ‘y’ array is assigned the values of the last column in the ‘input_sequences’ array, which represents the target or predicted word.

### One Hot Encoding

In [None]:
y = np.array(tf.keras.utils.to_categorical(y, num_classes=total_words))

## Building Nueral Network

In [None]:
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(LSTM(150))
model.add(Dense(total_words, activation='softmax'))
print(model.summary())

The code above defines the model architecture for the next word prediction model. <br>The ‘Sequential’ model is created, which represents a linear stack of layers. <br><br>The first layer added to the model is the ‘Embedding’ layer, which is responsible for converting the input sequences into dense vectors of fixed size.<br> It takes three arguments:
1) ‘total_words’, which represents the total number of distinct words in the vocabulary; 
2) ‘100’, which denotes the dimensionality of the word embeddings; 
2) ‘input_length’, which specifies the length of the input sequences.
<br>
The next layer added is the ‘LSTM’ layer, a type of recurrent neural network (RNN) layer designed for capturing sequential dependencies in the data. It has 150 units, which means it will learn 150 internal representations or memory cells.
<BR>
Finally, the ‘Dense’ layer is added, which is a fully connected layer that produces the output predictions. It has ‘total_words’ units and uses the ‘softmax’ activation function to convert the predicted scores into probabilities, indicating the likelihood of each word being the next one in the sequence.

## Compile the model


In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [None]:
model.fit(X, y, epochs=100, verbose=1)

In the above code, the model is being compiled and trained. The ‘compile’ method configures the model for training. The ‘loss’ parameter is set to ‘categorical_crossentropy’, a commonly used loss function for multi-class classification problems. The ‘optimizer’ parameter is set to ‘adam’, an optimization algorithm that adapts the learning rate during training.


The ‘metrics’ parameter is set to ‘accuracy’ to monitor the accuracy during training. After compiling the model, the ‘fit’ method is called to train the model on the input sequences ‘X’ and the corresponding output ‘y’. The ‘epochs’ parameter specifies the number of times the training process will iterate over the entire dataset. The ‘verbose’ parameter is set to ‘1’ to display the training process.

In [None]:
seed_text = "I will leave if they"
next_words = 3

for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted = np.argmax(model.predict(token_list), axis=-1)
    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break
    seed_text += " " + output_word

print(seed_text)