<a href="https://colab.research.google.com/github/gowtamyreddy/NLP/blob/main/RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Using RNN to predict the next word in the sentence/para ,once we give the input**

In [2]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN,Embedding,Dense
import numpy as np



In [4]:
#Load the data and preprocessing the datta
def load_data(file_path):
  with open(file_path,'r', encoding='utf-8') as f:
    text=f.read()
  return text

file_path = '/content/sample_data/01 Harry Potter and the Sorcerers Stone.txt'
text=load_data(file_path).lower() #converting the text to lowercase
print(text[:1500])


m r. and mrs. dursley, of number four, privet drive, were proud to say that they were perfectly normal, thank you very much. they were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.

mr. dursley was the director of a firm called grunnings, which made drills. he was a big, beefy man with hardly any neck, although he did have a very large mustache. mrs. dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. the dursleys had a small son called dudley and in their opinion there was no finer boy anywhere.

the dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. they didn’t think they could bear it if anyone found out about the potters. mrs. potter was mrs. dursley’s sister, but they hadn’t met for several ye

In [5]:
#Tokenization
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences #Padding

# Out-Of-Vocabulary token
# If a word not seen during training appears later, it will be replaced with <OOV>
# Helps handle unknown words instead of ignoring them
tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts([text]) #Mapping of words to unique integers
total_words = len(tokenizer.word_index)+1 #Total number of unique words

#convert text to Sequences
input_sequences=[]
tokens = tokenizer.texts_to_sequences([text])[0] #Converts input text into a list of number based on the word index
seq_len = 100 #Each input contains 100 words

# First seq_length tokens (input): Used for training the model.
# Last token (target): Used as the label the model tries to predict.
# so total of (50 + 1) in one input_sequence index

for i in range(seq_len,len(tokens)):
  input_sequences.append(tokens[i-seq_len:i+1])

#Padding sequences and split inputs/targets
#x will have inputs y will have outputs

input_sequences = np.array(pad_sequences(input_sequences,maxlen=seq_len +1, padding = 'pre'))
x,y = input_sequences[:,:-1],input_sequences[:,-1]
print(x[0])
print(y[0])
print(x.shape)
print(y.shape)





[2162 3680    4  274  224    8  651  332  652  535   35 1268    5  164
   20   21   35 1586  973 1587   14   69  157   21   35    2  141  128
  653  789    5   32 1588   12  169  490  110 1416  142   21   68   55
  909   25  505 1788  151  224   10    2 2701    8    6 2702  275 2703
  140  183 1417    7   10    6  394 3681  333   25  491  191  593  974
    7  131   36    6   69  233 1418  274  224   10  975    4 2704    4
   17  343  689    2  594 3682    8  593  140  159   12   69 1789   22
   46  910]
53
(80922, 100)
(80922,)


In [6]:
#one hot encode the labels
y = tf.keras.utils.to_categorical(y, num_classes=total_words)
#Build the simple RNN model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=total_words, output_dim=50, input_length=seq_len),
    SimpleRNN(300, return_sequences=True),
    # 2300 in RNN - The number of hidden units (size of the hidden state vector)
    SimpleRNN(300),
    Dense(total_words, activation='softmax')
 ])




In [7]:
#compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#train the model
model.fit(x, y, epochs=25,batch_size=128)

Epoch 1/25
[1m633/633[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m397s[0m 621ms/step - accuracy: 0.0404 - loss: 7.1334
Epoch 2/25
[1m633/633[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m443s[0m 622ms/step - accuracy: 0.0425 - loss: 6.8732
Epoch 3/25
[1m633/633[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m389s[0m 614ms/step - accuracy: 0.0412 - loss: 6.9760
Epoch 4/25
[1m633/633[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m447s[0m 623ms/step - accuracy: 0.0475 - loss: 6.6510
Epoch 5/25
[1m633/633[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m443s[0m 625ms/step - accuracy: 0.0624 - loss: 6.3398
Epoch 6/25
[1m633/633[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m390s[0m 616ms/step - accuracy: 0.0858 - loss: 6.0917
Epoch 7/25
[1m633/633[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m388s[0m 613ms/step - accuracy: 0.1000 - loss: 5.8589
Epoch 8/25
[1m633/633[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m443s[0m 614ms/step - accuracy: 0.1171 - loss: 5.5350
Epoch 9/

<keras.src.callbacks.history.History at 0x7e0f102cbe90>

In [8]:
# Function to generate text using RNN
def generate_text(seed_text, next_words=50, seq_length=50): # Added seq_length with default value 100
    for _ in range(next_words):
        tokenized_input = tokenizer.texts_to_sequences([seed_text])[0]
        tokenized_input = pad_sequences([tokenized_input], maxlen=seq_length, padding='pre') #seq_length is now in scope

        predicted_probs = model.predict(tokenized_input, verbose=0)
        predicted_index = np.argmax(predicted_probs)
        predicted_word = tokenizer.index_word.get(predicted_index, "<OOV>")

        seed_text += " " + predicted_word
    return seed_text

# Generate text using the trained model
print(generate_text("harry at hogwarts")) #all with the default seq_length

harry at hogwarts his hands had broken a sound of people whooshing and the movements of magic already and what he had been forgotten to see the damage is — but it’s incredible he was fine of the first years in the world to get past fluffy but he didn’t have to be
