# 2. Ponniyn Selvan RNN Chatbot with pre-trained CBoW embeddings

This notebook, the second in this week's assignment, aims to build our first model. Here, we will train the model using pre-trained embeddings from the historic Tamil text Ponniyin Selvan. This setup will later allow us to compare it to a second model, where our RNN will learn the embeddings on its own, which we will construct in the third notebook.

## Step 1: Import Libraries

In [1]:
import csv
import itertools
import operator
import numpy as np
import nltk
from datetime import datetime
import matplotlib.pyplot as plt
from nltk import tokenize
from collections import Counter
import re
import random
import pickle

# For Keras model
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense

from indicnlp.tokenize.sentence_tokenize import sentence_split
from indicnlp.tokenize import indic_tokenize

# Download NLTK data if needed
nltk.download("punkt")

Using TensorFlow backend.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lokes\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Step 2: Data Cleaning and Preparation

We are going to set up tokens first:

- unknown_token: This token will represent words that aren’t in our vocabulary. It helps the model manage unfamiliar words during training and generation.

- sentence_start_token: This token will be added to the beginning of every sentence, so the model understands where sentences start.

- sentence_end_token: This token will be placed at the end of each sentence, allowing the model to know when the sentence is complete.

These tokens will help the model structure sentences and deal with unknown words effectively during training and text generation.

In [2]:
unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

We want to clean any numbers and roman numerals that might arise in the in our dataset.

In [3]:
def clean_numbers(text):
    pattern = r"[\d-]"
    return re.sub(pattern, '', text)

def clean_roman_numerals(text):
    pattern = r"\b(?=[MDCLXVIΙ])M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})([IΙ]X|[IΙ]V|V?[IΙ]{0,3})\b\.?"
    return re.sub(pattern, '', text)

Next, we are going to read and clean the text from The Invisible Man by H.G. Wells. The goal is to prepare the text so it can be fed into our model without unnecessary punctuation, chapter headings, or formatting issues. Here's how we'll do that:

- <b>Reading the file:</b> We’ll open the text file (invisible_man_gutenberg.txt) and read its content into memory.

- <b>Removing unwanted punctuation:</b>
        We'll remove specific punctuation marks, such as commas, colons, quotes, and dashes, by creating a translation table.
        We’ll also use regular expressions to keep sentence-ending punctuation like periods (.), question marks (?), and exclamation marks (!) but remove all other unwanted punctuation.

- <b>Handling sentence-ending punctuation:</b>
        We’ll replace ? and ! with periods (.) to normalize sentence ends, which is useful for consistent sentence boundaries during training.

- <b>Removing chapter headings and titles:</b>
        Chapter headings like "CHAPTER I" and other titles at the beginning of chapters will be removed to prevent the model from learning irrelevant text structures.

- <b>Converting text to lowercase:</b>
        By converting everything to lowercase, we ensure that words like "Invisible" and "invisible" are treated as the same word during training.

- <b>Removing extra whitespace:</b>
        We’ll also clean up any extra spaces or line breaks in the text so it’s uniformly formatted before being tokenized.

In [4]:
print("Reading txt file...")
with open(r'ponniyin-selvan.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Preprocessing: Replacing punctuation and cleaning
text = text.replace(",\n", " _eol_ ")
text = text.replace(",", " _comma_ ")
text = text.replace(":", " _comma_ ")
text = text.replace(";", " _comma_ ")
text = text.replace("?\n", ". ")
text = text.replace("!\n", ". ")
text = text.replace(".\n", ". ")
text = text.replace('"', "")  # Remove double quotes
text = text.replace("'", "")  # Remove single quotes
text = text.replace("?", ".")
text = text.replace("!", ".")
text = text.replace("\t", "")
text = text.replace("\u200c", "")  # Remove zero-width non-joiner
text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces

# Additional cleaning
text = clean_numbers(text)
text = clean_roman_numerals(text)

# Sentence splitting using indic-nlp-library for Tamil
sentences = sentence_split(text, lang='ta')  # Tamil language code

# Lowercase and tokenize the sentences
sentences = [s.lower().strip() for s in sentences if len(s.split()) > 2]
tokenized_sentences = [indic_tokenize.trivial_tokenize(s, lang='ta') for s in sentences]

# Now, limit the corpus to the first 6,000 sentences
num_sentences_to_use = 8000
tokenized_sentences = tokenized_sentences[:num_sentences_to_use]

print(f"Total number of sentences: {len(tokenized_sentences)}")

print("Preprocessing done!")

Reading txt file...
Total number of sentences: 8000
Preprocessing done!


In [5]:
sentences[0:10]

['பொன்னியின் செல்வன் வரலாற்றுப் புதினம் அமரர் கல்கி கிருஷ்ணமூர்த்தி அத்தியாயம்   ஆடித்திருநாள் ஆதி அந்தமில்லாத கால வெள்ளத்தில் கற்பனை ஓடத்தில் ஏறி நம்முடன் சிறிது நேரம் பிரயாணம் செய்யுமாறு நேயர்களை அழைக்கிறோம்.',
 'விநாடிக்கு ஒரு நூற்றாண்டூ வீதம் எளிதில் கடந்து இன்றைக்குத் தொள்ளாயிரத்து எண்பத்திரண்டூ (ல் எழுதியது) ஆண்டூகளுக்கு முந்திய காலத்துக்குச் செல்வோமாக.',
 'தொண்டை நாட்டுக்கும் சோழ நாட்டுக்கும் இடையில் உள்ள திருமுனைப்பாடி நாட்டின் தென்பகுதியில் _comma_ தில்லைச் சிற்றம்பலத்துக்கு மேற்கே இரண்டூ காததூரத்தில் _comma_ அலை கடல் போன்ற ஓர் ஏரி விரிந்து பரந்து கிடக்கிறது.',
 'அதற்கு வீரநாராயண ஏரி என்று பெயர்.',
 'அது தெற்கு வடக்கில் ஒன்றரைக் காத நீளமும் கிழக்கு மேற்கில் அரைக் காத அகலமும் உள்ளது.',
 'காலப்போக்கில் அதன் பெயர் சிதைந்து இந்நாளில் வீராணத்து ஏரி என்ற பெயரால் வழங்கி வருகிறது.',
 'புது வெள்ளம் வந்து பாய்ந்து ஏரியில் நீர் நிரம்பித் ததும்பி நிற்கும் ஆடி ஆவணி மாதங்களில் வீரநாராயண ஏரியைப் பார்ப்பவர் எவரும் நம்முடைய பழந்தமிழ் நாட்டு முன்னோர்கள் தங்கள் காலத்தில் சாதித்த அரும்பெரும் காரி

We are going to tokenize the text and add start and end tokens:

- <b>Tokenize the text</b>:
        We’ll split the text into sentences using tokenize.sent_tokenize() and count the total number of sentences.

- <b>Add start and end tokens:</b>
        For each sentence, we’ll add SENTENCE_START at the beginning and SENTENCE_END at the end to help the model understand sentence boundaries.

- <b>Example output:</b>
        We’ll print the first 10 tokenized sentences to verify that everything is working as expected.


Now we are going to clean and tokenize the sentences:

- <b>Remove unwanted punctuation:</b>
        We’ll remove periods from the tokenized sentences while keeping the sentence boundaries intact. This ensures we don’t lose important punctuation like SENTENCE_START and SENTENCE_END.
- <b>Count word frequencies:</b>
        Using Counter, we’ll count how often each word appears in the text and print the total number of unique word tokens.

In [6]:
# 2. Tokenize and build vocabulary
unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

# Add SENTENCE_START and SENTENCE_END tokens
tokenized_sentences = [[sentence_start_token] + sentence + [sentence_end_token] for sentence in tokenized_sentences]
# Flatten tokenized sentences to get all words
all_words = [word for sentence in tokenized_sentences for word in sentence]

# Count word frequencies
word_freq = Counter(all_words)
print(f"Found {len(word_freq)} unique word tokens.")

Found 21129 unique word tokens.


In [7]:
# Limit the vocabulary to cover 95% of the text
sorted_word_freq = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
total_word_count = sum(word_freq.values())
coverage = 0
vocab_size = 0
desired_coverage = 0.95
for word, count in word_freq.most_common():
    coverage += count / total_word_count
    vocab_size += 1
    if coverage >= desired_coverage:
        break

print(f"Selected vocabulary size: {vocab_size} with {desired_coverage * 100}% coverage")

Selected vocabulary size: 16137 with 95.0% coverage


1. We sort the words by frequency and select the most common ones to cover 95% of the total word occurrences, determining the vocabulary size.  
2. Mappings (`index_to_word` and `word_to_index`) are created for the vocabulary, including an `unknown_token` for out-of-vocabulary words.  
3. Rare words in the tokenized sentences are replaced with `UNKNOWN_TOKEN` to ensure consistency during training.  
4. An example sentence is shown with rare words replaced to verify how sentences look after processing.

In [8]:
# Create mappings from word to index and index to word
vocab = word_freq.most_common(vocab_size - 1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = {word: i for i, word in enumerate(index_to_word)}

# Replace words not in the vocabulary with UNKNOWN_TOKEN
tokenized_sentences = [[word if word in word_to_index else unknown_token for word in sentence] for sentence in tokenized_sentences]
tokenized_sentences = [
    [word for word in sentence if word != '.']
    for sentence in tokenized_sentences
]
# Show an example sentence after rare word handling
print(f"Example sentence after replacing rare words: {tokenized_sentences[0]}")

Example sentence after replacing rare words: ['SENTENCE_START', 'பொன்னியின்', 'செல்வன்', 'வரலாற்றுப்', 'புதினம்', 'அமரர்', 'கல்கி', 'கிருஷ்ணமூர்த்தி', 'அத்தியாயம்', 'ஆடித்திருநாள்', 'ஆதி', 'அந்தமில்லாத', 'கால', 'வெள்ளத்தில்', 'கற்பனை', 'ஓடத்தில்', 'ஏறி', 'நம்முடன்', 'சிறிது', 'நேரம்', 'பிரயாணம்', 'செய்யுமாறு', 'நேயர்களை', 'அழைக்கிறோம்', 'SENTENCE_END']


In [9]:
import random

# Select a random sentence from tokenized_sentences
random_sentence = random.choice(tokenized_sentences)

# Convert the sentence to indices using word_to_index
sentence_indices = [word_to_index[word] for word in random_sentence]

# Convert the indices back to words using index_to_word
sentence_words = [index_to_word[index] for index in sentence_indices]

# Print the results
print("Random sentence:", random_sentence)
print("Sentence as indices:", sentence_indices)
print("Sentence from indices:", sentence_words)

Random sentence: ['SENTENCE_START', 'இப்படியாவது', 'இராஜ்யம்', 'சம்பாதிக்க', 'வேண்டுமா', 'SENTENCE_END']
Sentence as indices: [1, 13210, 2748, 13211, 2082, 2]
Sentence from indices: ['SENTENCE_START', 'இப்படியாவது', 'இராஜ்யம்', 'சம்பாதிக்க', 'வேண்டுமா', 'SENTENCE_END']


## Step 3: Generating N-Grams

Range of n-grams: We are harvesting all n-grams from bigrams (2-grams) to 20-grams by iterating through lengths from 2 to 20.

Count n-grams: For each length i, the function ngrams() is used to generate all possible n-grams of that length from the text. The Counter is then used to count the occurrences of each n-gram.

Store n-grams: The counts for each n-gram length are printed and stored in the ngrams_up_to_20 list.

In [10]:
from nltk.util import ngrams
from collections import Counter

# Harvesting all n-grams up to length 20
ngrams_up_to_20 = []
for i in range(2, 21):
    ngram_counts = Counter(ngrams(text.split(), i))  # Collecting n-grams of length i
    print(f'ngram-{i} length:', len(ngram_counts))
    ngrams_up_to_20.append(ngram_counts)

ngram-2 length: 326762
ngram-3 length: 416139
ngram-4 length: 427866
ngram-5 length: 429198
ngram-6 length: 429397
ngram-7 length: 429440
ngram-8 length: 429454
ngram-9 length: 429456
ngram-10 length: 429457
ngram-11 length: 429457
ngram-12 length: 429456
ngram-13 length: 429455
ngram-14 length: 429454
ngram-15 length: 429453
ngram-16 length: 429452
ngram-17 length: 429451
ngram-18 length: 429450
ngram-19 length: 429449
ngram-20 length: 429448


We need to ensure that the n-grams we keep are complete and not broken by sentence-ending punctuation. So, we will implement helper functions to enure that. They are:
- remove_periods(): This function checks if any word in the n-gram contains a period or quotation mark. If any such characters are found, the function returns False, indicating that the n-gram should be excluded.

- my_filter(): This function applies remove_periods() to a list of n-grams, filtering out any n-grams that span sentence boundaries 

In [11]:
# Function to remove n-grams that contain periods or quotes
def remove_periods(ngram):
    """Remove n-grams that contain periods or quotes."""
    for word in ngram[0]:
        if '.' in word or '’' in word or '‘' in word:
            return False
    return True

# Keep only repeating n-grams
def my_filter(ngrams):
    """Filter n-grams to only keep those that occur more than once and do not span sentence boundaries."""
    return filter(remove_periods, ngrams)

## Step 4: Creating the Final Dataset

Now, we'll construct the training dataset (X_train and y_train) using n-grams from 2-grams to 20-grams:

- Initialize training data: Empty lists X_train and y_train are created to store the input sequences and target words.

- Process n-grams: For each set of n-grams (from bigrams to 20-grams), we iterate through the most common n-grams.

- Filter valid n-grams: Using my_filter(), we ensure all n-grams pass certain conditions, and only those where all words are in the vocabulary (word_to_index) are considered.

- Create training examples:
        X_train: The input sequence consists of the n-gram minus the last word.
        y_train: The target is the last word of the n-gram.

- Final count: The total number of sequences generated from n-grams is printed, showing how many training examples were created.

In [12]:
# Initialize training data lists
X_train = []
y_train = []

# Process all n-grams from 2 to 20
for i in range(len(ngrams_up_to_20)):  # Starting from bigrams
    ngrams_to_learn = ngrams_up_to_20[i]

    # Construct X_train and y_train using the filtered n-grams
    for sent in my_filter(ngrams_to_learn.most_common()):
        ngram = sent[0]
        # Ensure all words are in vocabulary
        if all(word in word_to_index for word in ngram):
            ngram_indices = [word_to_index[word] for word in ngram]
            # Input sequence is the n-gram minus the last word
            X_train.append(ngram_indices[:-1])
            # Target is the last word of the n-gram
            y_train.append(ngram_indices[-1])

print(f'Total sequences from n-grams: {len(X_train)}')

Total sequences from n-grams: 340256


We now expand the training dataset by incorporating sequences from complete tokenized sentences. For each sentence, we ensure that all words are in the vocabulary, and then we create input sequences using the words leading up to the current word, with the current word as the target output. This allows the model to learn from entire sentence structures, improving its ability to predict the next word based on the broader context of the sentence.


In [13]:
# Include sequences from your tokenized sentences
for sentence in tokenized_sentences:
    if all(word in word_to_index for word in sentence):
        sentence_indices = [word_to_index[word] for word in sentence]
        for i in range(1, len(sentence_indices)):
            X_train.append(sentence_indices[:i])
            y_train.append(sentence_indices[i])

print(f'Total sequences after including sentences: {len(X_train)}')

Total sequences after including sentences: 421996


We combine X_train (input sequences) and y_train (target words) into a single list of tuples, then shuffle them together using random.shuffle().
After shuffling, we unpack the combined list back into X_train and y_train, keeping the pairs aligned.

In [14]:
# Shuffle the data
combined = list(zip(X_train, y_train))
random.shuffle(combined)
X_train[:], y_train[:] = zip(*combined)

we are preparing the training data by padding the sequences and converting them into the appropriate format for model training:

- Determine maximum sequence length: We calculate the length of the longest sequence in X_train to use this as the standard for padding all sequences to the same length.

- Pad sequences: Using pad_sequences(), we pad the input sequences (X_train) with zeros at the beginning (pre-padding). This ensures that all sequences have the same length, which is necessary for training models that expect fixed-length input.

- Convert y_train to a NumPy array: We convert y_train to a NumPy array, making it compatible with machine learning libraries that require data in this format.

- Print shape of padded data: We print the shape of the padded X_train and y_train to verify that they are ready for model training.

In [15]:
from keras.preprocessing.sequence import pad_sequences

# Determine the maximum sequence length
max_seq_length = max(len(seq) for seq in X_train)
print(f'Max sequence length: {max_seq_length}')

# Pad sequences with zeros at the beginning
X_train_padded = pad_sequences(X_train, maxlen=max_seq_length, padding='pre')

# Convert y_train to a NumPy array
y_train = np.array(y_train)

print(f'X_train_padded shape: {X_train_padded.shape}, y_train shape: {y_train.shape}')

Max sequence length: 75
X_train_padded shape: (421996, 75), y_train shape: (421996,)


We are splitting the dataset into training and validation sets. Using train_test_split(), 90% of the data is allocated for training, while 10% is reserved for validation. The validation set helps us assess the model's performance on unseen data, ensuring it generalizes well beyond the training set. The split is made reproducible by setting a random seed (random_state=42).

In [16]:
from sklearn.model_selection import train_test_split

# Splitting the data
X_train_padded_train, X_train_padded_val, y_train_train, y_train_val = train_test_split(
    X_train_padded, y_train, test_size=0.1, random_state=42
)

print(f'Training data shape: {X_train_padded_train.shape}, {y_train_train.shape}')
print(f'Validation data shape: {X_train_padded_val.shape}, {y_train_val.shape}')

Training data shape: (379796, 75), (379796,)
Validation data shape: (42200, 75), (42200,)


## Step 5: Saving all the Training Data

Since this is a model that takes extensive tuning and training, we are saving the processed training data and vocabulary to disk using pickle.

The main idea behind pickling is to avoid having to redo the entire preprocessing each time we want to use the data. By saving X_train, y_train, the tokenized sentences, and the word-to-index and index-to-word mappings, we can easily reload them later. 

This makes sure that we always have access to all the necessary data for training, predictions and further analysis, as it can be loaded quickly without re-running the entire data preparation pipeline.

In [17]:
import pickle

# Save the processed data
with open('pickle/X_train_padded_ps.pkl', 'wb') as file:
    pickle.dump(X_train_padded, file)
with open('pickle/y_train_ps.pkl', 'wb') as file:
    pickle.dump(y_train, file)
with open('pickle/tokenized_sentences_ps.pkl', 'wb') as file:
    pickle.dump(tokenized_sentences, file)
with open('pickle/word_to_index_ps.pkl', 'wb') as file:
    pickle.dump(word_to_index, file)
with open('pickle/index_to_word_ps.pkl', 'wb') as file:
    pickle.dump(index_to_word, file)

print("Data saved successfully!")

Data saved successfully!


In [18]:
# # Load CBOW embeddings from your saved file
# embeddings_index = {}  # Initialize dictionary

# # Open your CBOW embedding file
# with open('my_cbow_vectors_ps.txt', 'r', encoding='utf-8') as f:
#     # Read the header
#     vocab_size, embedding_dim = map(int, f.readline().split())
    
#     for line in f:
#         values = line.split()
#         word = values[0]  # The word itself
#         coefs = np.asarray(values[1:], dtype='float32')  # The word vector
#         embeddings_index[word] = coefs

# print('Found %s word vectors.' % len(embeddings_index))

# # Create the embedding matrix for the CBOW embeddings
# embedding_matrix = np.zeros((vocab_size, embedding_dim))
# for word, i in word_to_index.items():
#     if i >= vocab_size:
#         continue
#     embedding_vector = embeddings_index.get(word)
#     if embedding_vector is not None:
#         embedding_matrix[i] = embedding_vector

# print(f'Embedding matrix shape: {embedding_matrix.shape}')

## Step 6: Loading our pre-trained CBoW Embeddings

Now, we are loading pre-trained word embeddings from our previous notebooks where we created embeddings using CbOw on Ponniyn Selvan into a dictionary called embeddings_index. 
- Each word from the CBoW file is mapped to its corresponding vector representation, which captures its semantic meaning in a numerical form. 
- By storing these embeddings, we can later use them to initialize the word representations in our model, allowing it to leverage the rich, pre-learned relationships between words, which can improve the model's performance, especially when dealing with limited training data.

In [19]:
# Load GloVe embeddings
embeddings_index = {}  # Initialize dictionary

# Open the GloVe embeddings file
with open('my_cbow_vectors_ps.txt', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]  # The word itself
        coefs = np.asarray(values[1:], dtype='float32')  # The word vector
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

# Create the embedding matrix
embedding_dim = 100
embedding_matrix = np.zeros((vocab_size, embedding_dim))
print(vocab_size)
for word, i in word_to_index.items():
    if i >= vocab_size:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

print(f'Embedding matrix shape: {embedding_matrix.shape}')

Found 69351 word vectors.
16137
Embedding matrix shape: (16137, 100)


## Step 7: Building our Recurrent Neural Network Model

We define an RNN model using Keras with two LSTM layers for text sequence prediction:
- Input Layer: Takes sequences of predefined maximum length.
- Embedding Layer: Uses pre-trained CBoW embeddings, which are non-trainable, for initializing word vectors.
- LSTM Layers: Includes two LSTM layers with 64 hidden units each. To prevent overfitting and stabilize training, each LSTM layer is followed by a Dropout layer and BatchNormalization.
- Output Layer: A Dense layer with softmax activation outputs probabilities for the next word.
- ReduceLROnPlateau: Configures ReduceLROnPlateau to decrease the learning rate when the model's loss plateaus, aiding in convergence.

In [20]:
from keras.models import Model
from keras.layers import Input, Embedding, LSTM, Dense, Dropout, BatchNormalization
from keras.optimizers import Adam
from keras.callbacks import ReduceLROnPlateau

# Hidden dimensions for LSTM layers
hidden_dim = 64

# Define input layer
inputs = Input(shape=(max_seq_length,), name='input_layer')

# Embedding Layer
embedding = Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim,
    weights=[embedding_matrix],
    trainable=False,
    name='embedding_layer'
)(inputs)

# First LSTM Block
lstm1 = LSTM(units=hidden_dim, return_sequences=True, name='lstm_1')(embedding)
dropout1 = Dropout(0.1, name='dropout_1')(lstm1)
bn1 = BatchNormalization(name='batch_norm_1')(dropout1)

# Second LSTM Block
lstm2 = LSTM(units=hidden_dim, name='lstm_2')(bn1)
dropout2 = Dropout(0.1, name='dropout_2')(lstm2)
bn2 = BatchNormalization(name='batch_norm_2')(dropout2)

# Output Layer
outputs = Dense(vocab_size, activation='softmax', name='output_layer')(bn2)

# Define the Model
model = Model(inputs=inputs, outputs=outputs, name='functional_rnn_model')

# Compile the Model
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=Adam()  # You can specify learning rate or other parameters if needed
)

# Model Summary
model.summary()

# Define the learning rate reduction callback
reduce_lr = ReduceLROnPlateau(
    monitor='loss',
    factor=0.5,
    patience=1,
    verbose=1,
    min_lr=1e-6  # Optional: set a minimum learning rate
)

Model: "functional_rnn_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_layer (InputLayer)     (None, 75)                0         
_________________________________________________________________
embedding_layer (Embedding)  (None, 75, 100)           1613700   
_________________________________________________________________
lstm_1 (LSTM)                (None, 75, 64)            42240     
_________________________________________________________________
dropout_1 (Dropout)          (None, 75, 64)            0         
_________________________________________________________________
batch_norm_1 (BatchNormaliza (None, 75, 64)            256       
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)       

## Step 8: Training our RNN Model on Ponniyn Selvan

We are training the model using the fit method, where the training data (X_train_padded_train and y_train_train) is used to adjust the model's weights over 150 epochs, with a batch size of 128. The validation data (X_train_padded_val and y_train_val) is used to monitor the model's performance on unseen data during training. The ReduceLROnPlateau callback reduces the learning rate if the loss stops improving, helping the model converge more effectively.

In [21]:
# Now retry training the model
history = model.fit(
    X_train_padded_train, y_train_train,
    validation_data=(X_train_padded_val, y_train_val),
    batch_size=128,
    epochs=50,
    callbacks=[reduce_lr]  # Include the learning rate callback
)


Train on 379796 samples, validate on 42200 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


## Step 9: Saving the model

In [22]:
# Save the model weights using pickle
# with open('tensorflow_model.pkl', 'wb') as file:
#     pickle.dump(model, file)

model.save('model/rnn_cbow_model.h5')
print('Model Saved!')

Model Saved!


## Step 10: Text Generation

#### Generate Paragraph

The generate_sentence() function uses the trained RNN model to generate sentences. It starts with the sentence_start_token and predicts each subsequent word by sampling from the model's probability distribution for the next word. The generation continues until it reaches the sentence_end_token or a specified sentence length. The predicted word indices are then converted back into actual words, forming a complete sentence. This function enables the model to create coherent text based on the patterns it has learned during training.

In [23]:
def generate_sentence(model, word_to_index, index_to_word, max_seq_length, senten_max_length):
    # We start the sentence with the start token
    new_sentence = [word_to_index[sentence_start_token]]
    
    # Repeat until we get an end token or reach the maximum sentence length
    while (new_sentence[-1] != word_to_index[sentence_end_token]) and len(new_sentence) < senten_max_length:
        # Prepare the input sequence
        sequence = new_sentence
        # Pad the sequence
        sequence_padded = pad_sequences([sequence], maxlen=max_seq_length, padding='pre')
        
        # Predict the next word
        predicted_probs = model.predict(sequence_padded, verbose=0)[0]
        # Get the probabilities for the last time step
        next_word_probs = predicted_probs
        
        # Sample the next word, avoiding UNKNOWN_TOKEN
        sampled_word_index = word_to_index[unknown_token]
        while sampled_word_index == word_to_index[unknown_token]:
            # Sample from the distribution
            sampled_word_index = np.random.choice(len(next_word_probs), p=next_word_probs)
        
        # Append the sampled word to the sentence
        new_sentence.append(sampled_word_index)
    
    # Convert indices to words, excluding SENTENCE_START and SENTENCE_END tokens
    sentence_str = [index_to_word[idx] for idx in new_sentence[1:-1]]
    return ' '.join(sentence_str)

The generate_sentence() function uses the trained RNN model to generate sentences. It starts with the sentence_start_token and predicts each subsequent word by sampling from the model's probability distribution for the next word. The generation continues until it reaches the sentence_end_token or a specified sentence length. The predicted word indices are then converted back into actual words, forming a complete sentence. This function enables the model to create coherent text based on the patterns it has learned during training.

In [24]:
num_sentences = 20
senten_min_length = 7
senten_max_length = 20

for _ in range(num_sentences):
    sent = ''
    # We want long sentences, not sentences with one or two words
    while len(sent.split()) < senten_min_length:
        sent = generate_sentence(model, word_to_index, index_to_word, max_seq_length, senten_max_length)
    print(sent)

” “தாங்கள் கவலைப்படவில்லை மலையமான் மகள் வானவன்மாதேவிக்கும் தாதியர்களுக்கும் ஒரு பெரிய மரத்தின் அடியில் இறங்கினார்களோ _ comma _ காடூ சுற்றிச் சுற்றி
_ comma _ பல இள மரங்களின் மாதரசி
நான் அவ்விதம் செய்வதற்கு பெரும் போர் விட்டூ வருகிறார்கள்
படூத்துக் மிகப் பக்கம் சகிக்காமல் ஆர்வம் ஆதித்த கரிகாலரின் உயிருக்கு இப்போது நம்மிடம் உயிர்க்குயிரான நன்கு உணர்ந்து கொண்டேன்
அல்லது பனித்துளி அனைவரும் வீடுகளும் கடைவீதிகளும் சிவாலயக் கற்றளிகளும் திருமாலுக்குரிய விண்ணகரங்களும் குரல்
கம்பீரமான சோழ மன்னரின் இளம் புதல்வர் அருள்மொழிவர்மர் முன்வந்தார் “அப்பா என்று கட்டளையிட்டார் _ comma _ _ comma _ ஆனால் கதை
ஆண்டாளின் பெற்ற பட்டத்து பராந்தக சக்கரவர்த்தி அருமை மூத்த புதல்வரிடமிருந்து கடிதங்களும் நான் என்ன அவசர காரியம் இல்லை
குந்தவையும் ஒற்றன் அதோ நாட்டில் தமிழ்ப் பாடலுக்குப் பசித்திருக்கின்றன
சைவர்களும் வைஷ்ணவர்களும் என்று என்னை _ comma _ கடம்பூர் சம்புவரையர் மகனை கூறினார்
காடூம் உன்னைக் மீண்டும் சோழ குலத்தின் செய்தார்கள் எதற்காகத் இங்கே அழைத்து வரச்
மாட்டீர்கள் அந்தரங்க நினைத்துப் பார்த்தால் _ comma _ பல இள 

#### Generating based on a Prompt

The generate_sentence() function generates a sentence based on a given starting text. It first converts the starting words into word indices using the model's vocabulary. Then, it uses the trained RNN model to predict and append the next word, continuing until the sentence reaches a specified maximum length or the end token is generated. 

In [25]:
def generate_sentence_with_start(model, word_to_index, index_to_word, max_seq_length, start_text, senten_max_length):
    # Convert the start_text to lowercase to match your vocabulary
    start_text = start_text.lower()
    
    # Convert the start_text into indices based on your vocabulary
    new_sentence = [word_to_index.get(word, word_to_index[unknown_token]) for word in start_text.split()]
    
    # Repeat until we get an end token or reach the max sentence length
    while len(new_sentence) < senten_max_length:
        # Prepare the input sequence
        sequence = new_sentence
        # Pad the sequence
        sequence_padded = pad_sequences([sequence], maxlen=max_seq_length, padding='pre')
        
        # Predict the next word
        predicted_probs = model.predict(sequence_padded, verbose=0)[0]
        # Get the probabilities for the last time step
        next_word_probs = predicted_probs
        
        # Sample the next word
        sampled_word_index = np.random.choice(len(next_word_probs), p=next_word_probs)
        
        # Check if the sampled word is the SENTENCE_END token
        if sampled_word_index == word_to_index.get(sentence_end_token):
            break  # Stop adding words if we reach the end token
        
        # Append the predicted word
        new_sentence.append(sampled_word_index)
    
    # Convert indices back to words
    sentence_str = [index_to_word[idx] for idx in new_sentence]
    generated_text = ' '.join(sentence_str)
    generated_text += '.'
    return generated_text

#### <b> Prompts and their Outputs </b>

In [26]:
start_prompt = "பொதுவாக"
generated_output = generate_sentence_with_start(model, word_to_index, index_to_word, max_seq_length, start_prompt, senten_max_length=50)

print("Generated text:", generated_output)

Generated text: பொதுவாக மலரின் இனிய ஜங்கார சுந்தர சோழரும் சுந்தர சோழ சக்கரவர்த்தி தர்ம ராஜ்யம் நடப்பதாகச் பார்த்துக் கொண்டிருந்த பெரும் பெரிய நதியில் நதி வரையிலும் வரையிலும் பரந்திருந்த தேசங்களிலிருந்து தலைநகருக்குப் பலர் திரும்பிப் பார்த்து அவளை என் காதில் விழுந்து அவருக்குச் முழுவதும் காது கொடுத்துக் கவனமாகக் கேட்டபோது என் தந்தை பார்த்து இன்று வரை வந்து விட்டதோ என்று இருக்கின்றன அல்லவா நீ சொன்னபடி வேலை செய்து.


In [27]:
start_prompt = "வணக்கம்"
generated_output = generate_sentence_with_start(model, word_to_index, index_to_word, max_seq_length, start_prompt, senten_max_length=50)

print("Generated text:", generated_output)

Generated text: வணக்கம் செய்துவிட்டுப் பாடவும் ஆடவும் தொடங்கினார்கள் என்று பல குரல்களில் முன்னால் கனவு செய்ய வேண்டும் என்று நண்பர்கள் இருவரும் அங்கிருந்து போய் விட்டுப் போவது என்று எனக்கே தெரியவில்லை கூடியவர்கள் சொல்லுங்கள் பார்த்து சொல்லி விட்டூ விட்டூ என்னைக் கொண்டுதான் தெரிந்து கொண்டூ போக முடியாது என்பதை பழுவூர் இளைய ராணியின் கோட்டைக் அழைத்து வரச் அவரைப் பார்த்து விட்டு திரும்பி வருவான் என்று சொல்ல சொல்ல வேண்டும்.


In [28]:
start_prompt = "சோழன்"
generated_output = generate_sentence_with_start(model, word_to_index, index_to_word, max_seq_length, start_prompt, senten_max_length=50)

print("Generated text:", generated_output)

Generated text: சோழன் அவளுடைய உதவியை அடியோடு விட்டூ விட்டு விட்டூ அனைவரும் நிம்மதி பல வர்ண இறகுகள் படைத்த பட்டுப் பூச்சி ஒருவன் வந்து மீண்டும் வழங்கும் இப்பாடல்களில் கடைசிப் சைவர் என்று பெயர் வந்தியத்தேவன் கூறியதும் கருணை சந்நியாசி அருகில் வந்து நின்று வண்ணம் பார்த்துக் கொண்டு விரைந்து கரையேறத் செய்து கொண்டூ புன்னகை வந்து புன்னகை புரிந்து கொண்டிருப்பது போல் இந்தச் சிங்கள மன்னர்களுக்குப் புத்தி கற்பிக்க வேண்டும்.


In [29]:
model.save_weights('model/rnn_cbow_model_weight.h5')
print('Model Saved!')

Model Saved!
