# 1. Ponniyin Selvan Word Embeddings using Continuous Bag of Words (CBoW)

## Introduction
In this notebook, we will train word embeddings using the Continuous Bag of Words (CBoW) model on a Tamil text corpus from the classic work *Ponniyin Selvan* by Kalki Krishnamurthy. Word embeddings are crucial in natural language processing (NLP) because they allow us to represent words as dense vectors that capture semantic relationships between them. 

**Objective**: We aim to preprocess the Tamil text, tokenize it, and build a CBoW model using Keras to learn meaningful embeddings. These embeddings can later be used for various NLP tasks, such as text classification, sentiment analysis, or even as input features for more complex deep learning models.

**Why Use CBoW?**: The CBoW model predicts a target word given its surrounding context words. It is simple yet effective in capturing semantic meanings and is computationally efficient compared to other models like Skip-gram.

## Step 1: Import Libraries

We start by importing the necessary libraries. These include `re` for regular expression operations, `numpy` for numerical computations, and TensorFlow and Keras for building our neural network model. We also import functions from `indic-nlp-library` to handle tokenization specific to Tamil.

In [1]:
import re
import numpy as np
import tensorflow as tf
from tensorflow.keras import backend as K
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Lambda, Reshape
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from indicnlp.tokenize.sentence_tokenize import sentence_split
from indicnlp.tokenize import indic_tokenize

## Step 2: Define Text Cleaning Functions

Next, we define functions to clean the text. The `clean_numbers` function removes digits, and the `clean_special_characters` function removes characters that are not part of the Tamil script.

In [2]:
def clean_numbers(text):
    pattern = r"[\d-]"
    return re.sub(pattern, '', text)

def clean_roman_numerals(text):
    pattern = r"\b(?=[MDCLXVIΙ])M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})([IΙ]X|[IΙ]V|V?[IΙ]{0,3})\b\.?"
    return re.sub(pattern, '', text)

## Step 3: Load and Preprocess the Text

We load the text from the file `Ponniyin Selvan.txt` and perform several preprocessing steps:
- Replace punctuation with spaces or periods.
- Remove unnecessary characters like quotes and extra spaces.
- Clean numbers and special characters.
Finally, we split the text into sentences using the `indic-nlp-library`.

In [3]:
print("Reading txt file...")
with open(r'ponniyin-selvan.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Preprocessing: Replacing punctuation and cleaning
text = text.replace(",\n", " _eol_ ")
text = text.replace(",", " _comma_  ")
text = text.replace(":", " _comma_  ")
text = text.replace(";", " _comma_  ")
text = text.replace("?\n", ". ")
text = text.replace("!\n", ". ")
text = text.replace(".\n", ". ")
text = text.replace('"', "")  # Remove double quotes
text = text.replace("'", "")  # Remove single quotes
text = text.replace("?", ".")
text = text.replace("!", ".")
text = text.replace('"', "")
text = text.replace("\t", "")
text = text.replace("  ", " ")
text = text.replace("\u200c", "")

# Additional cleaning: Remove numbers and combine spaces
text = clean_numbers(text)
text = clean_roman_numerals(text)
text = re.sub(r"\s+", " ", text).strip()

# Sentence splitting using indic-nlp-library for tamil
sentences = sentence_split(text, lang='ta')  # tamil language code

# Lowercase and tokenize the sentences
sentences = [s.lower().strip() for s in sentences if len(s.split()) > 2]
tokenized_sentences = [indic_tokenize.trivial_tokenize(s, lang='te') for s in sentences]

print("Preprocessing done!")

Reading txt file...
Preprocessing done!


In [4]:
sentences[0:10]

['பொன்னியின் செல்வன் வரலாற்றுப் புதினம் அமரர் கல்கி கிருஷ்ணமூர்த்தி அத்தியாயம் ஆடித்திருநாள் ஆதி அந்தமில்லாத கால வெள்ளத்தில் கற்பனை ஓடத்தில் ஏறி நம்முடன் சிறிது நேரம் பிரயாணம் செய்யுமாறு நேயர்களை அழைக்கிறோம்.',
 'விநாடிக்கு ஒரு நூற்றாண்டூ வீதம் எளிதில் கடந்து இன்றைக்குத் தொள்ளாயிரத்து எண்பத்திரண்டூ (ல் எழுதியது) ஆண்டூகளுக்கு முந்திய காலத்துக்குச் செல்வோமாக.',
 'தொண்டை நாட்டுக்கும் சோழ நாட்டுக்கும் இடையில் உள்ள திருமுனைப்பாடி நாட்டின் தென்பகுதியில் _comma_ தில்லைச் சிற்றம்பலத்துக்கு மேற்கே இரண்டூ காததூரத்தில் _comma_ அலை கடல் போன்ற ஓர் ஏரி விரிந்து பரந்து கிடக்கிறது.',
 'அதற்கு வீரநாராயண ஏரி என்று பெயர்.',
 'அது தெற்கு வடக்கில் ஒன்றரைக் காத நீளமும் கிழக்கு மேற்கில் அரைக் காத அகலமும் உள்ளது.',
 'காலப்போக்கில் அதன் பெயர் சிதைந்து இந்நாளில் வீராணத்து ஏரி என்ற பெயரால் வழங்கி வருகிறது.',
 'புது வெள்ளம் வந்து பாய்ந்து ஏரியில் நீர் நிரம்பித் ததும்பி நிற்கும் ஆடி ஆவணி மாதங்களில் வீரநாராயண ஏரியைப் பார்ப்பவர் எவரும் நம்முடைய பழந்தமிழ் நாட்டு முன்னோர்கள் தங்கள் காலத்தில் சாதித்த அரும்பெரும் காரியங

## Step 4: Initialize Model Parameters

We define the embedding dimensions (`dim`), the context window size (`window_size`), and calculate the vocabulary size (`V`). The vocabulary size is determined by the number of unique words in the corpus.

In [5]:
dim = 100  # Embedding dimensions
window_size = 2  # Context window size
V = len(set(word for sentence in tokenized_sentences for word in sentence)) + 1

## Step 5: Tokenize and Convert Words to Sequences

We use Keras's `Tokenizer` to convert the words into numerical sequences. This step is crucial for feeding the text data into our neural network.

In [6]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(tokenized_sentences)
corpus = tokenizer.texts_to_sequences(tokenized_sentences)
V = len(tokenizer.word_index) + 1

In [7]:
# Check the total number of sentences in the corpus
total_sentences = len(corpus)
print(f"Total number of sentences: {total_sentences}")

Total number of sentences: 52041


In [8]:
# Select a smaller subset of the corpus for testing
num_sentences_to_use = 6000
corpus_subset = corpus[:num_sentences_to_use]

## Step 6: Generate Training Data for CBoW

We create functions to generate training data for the CBoW model. The `generate_data` function yields context words and the target word, while the `generate_all_data_cbow` function generates all the training data at once.

In [9]:
def generate_data(corpus, window_size, V):
    maxlen = window_size * 2
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            contexts = []
            labels = []
            s = index - window_size
            e = index + window_size + 1
            
            contexts.append([words[i] for i in range(s, e) if 0 <= i < L and i != index])
            labels.append(word)
            
            x = pad_sequences(contexts, maxlen=maxlen)
            y = to_categorical(labels, V)
            yield (x, y)

def generate_all_data_cbow(corpus, window_size, V):
    all_in = []
    all_out = []
    for sentence in corpus:
        L = len(sentence)
        for index, word in enumerate(sentence):
            start = index - window_size
            end = index + window_size + 1
            context_words = [sentence[i] if 0 <= i < L and i != index else 0 for i in range(start, end)]
            all_in.append(context_words)
            all_out.append(to_categorical(word, V))
    return np.array(all_in), np.array(all_out)

## Step 7: Create Training Data

We use the `generate_all_data_cbow` function to generate all the training data at once. The `X_cbow` and `y_cbow` arrays contain the input context words and the target words, respectively.

In [10]:
X_cbow, y_cbow = generate_all_data_cbow(corpus_subset, window_size, V)
print(f"Data shapes - X: {X_cbow.shape}, y: {y_cbow.shape}")

Data shapes - X: (63478, 5), y: (63478, 69351)


## Step 8: Build the CBoW Model

We build a simple CBoW model using Keras. The model consists of an Embedding layer, a Lambda layer to average the context word embeddings, and a Dense output layer with a softmax activation.

In [11]:
cbow_model = Sequential()
cbow_model.add(Embedding(input_dim=V, output_dim=dim, input_length=window_size*2, embeddings_initializer='glorot_uniform'))
cbow_model.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(dim,)))
cbow_model.add(Dense(V, activation='softmax', kernel_initializer='glorot_uniform'))

cbow_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
cbow_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 4, 100)            6935100   
_________________________________________________________________
lambda (Lambda)              (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 69351)             7004451   
Total params: 13,939,551
Trainable params: 13,939,551
Non-trainable params: 0
_________________________________________________________________


## Step 9: Train the Model

We train the model using the `fit` method. The model will be trained for 15 epochs with a batch size of 64. The goal is to minimize the categorical cross-entropy loss and improve the accuracy.

In [12]:
cbow_model.fit(X_cbow, y_cbow, batch_size=64, epochs=75, verbose=1)

Epoch 1/75
Epoch 2/75
Epoch 3/75
Epoch 4/75
Epoch 5/75
Epoch 6/75
Epoch 7/75
Epoch 8/75
Epoch 9/75
Epoch 10/75
Epoch 11/75
Epoch 12/75
Epoch 13/75
Epoch 14/75
Epoch 15/75
Epoch 16/75
Epoch 17/75
Epoch 18/75
Epoch 19/75
Epoch 20/75
Epoch 21/75
Epoch 22/75
Epoch 23/75
Epoch 24/75
Epoch 25/75
Epoch 26/75
Epoch 27/75
Epoch 28/75
Epoch 29/75
Epoch 30/75
Epoch 31/75
Epoch 32/75
Epoch 33/75
Epoch 34/75
Epoch 35/75
Epoch 36/75
Epoch 37/75
Epoch 38/75
Epoch 39/75
Epoch 40/75
Epoch 41/75
Epoch 42/75
Epoch 43/75
Epoch 44/75
Epoch 45/75
Epoch 46/75
Epoch 47/75
Epoch 48/75
Epoch 49/75
Epoch 50/75
Epoch 51/75
Epoch 52/75
Epoch 53/75
Epoch 54/75
Epoch 55/75
Epoch 56/75
Epoch 57/75
Epoch 58/75
Epoch 59/75
Epoch 60/75
Epoch 61/75
Epoch 62/75
Epoch 63/75
Epoch 64/75
Epoch 65/75
Epoch 66/75
Epoch 67/75
Epoch 68/75
Epoch 69/75
Epoch 70/75
Epoch 71/75
Epoch 72/75
Epoch 73/75
Epoch 74/75
Epoch 75/75


<tensorflow.python.keras.callbacks.History at 0x2670ccb8940>

In [13]:
V_cbow = len(set(word for sentence in corpus_subset for word in sentence)) + 1

In [14]:
# Open the file in write mode with utf-8 encoding
with open('my_cbow_vectors_ps.txt', 'w', encoding='utf-8') as f:
    # Write the header: number of words and the dimension of the vectors
    f.write('{} {}\n'.format(V_cbow - 1, dim))

    # Retrieve the word vectors from the model
    vectors = cbow_model.get_weights()[0]

    # Loop through the word index from the tokenizer
    for word, i in tokenizer.word_index.items():
        # Convert the vector to a string
        str_vec = ' '.join(map(str, list(vectors[i, :])))
        # Write the word and its vector to the file
        f.write('{} {}\n'.format(word, str_vec))

In [15]:
import gensim
w2v_cbow = gensim.models.KeyedVectors.load_word2vec_format('./my_cbow_vectors_ps.txt', binary=False)

In [16]:
w2v_cbow.most_similar(positive=['சோழன்'])

[('சோழனும்', 0.6407018899917603),
 ('நிகர்த்த', 0.4783225953578949),
 ('கரிகாலர்', 0.4767846465110779),
 ('கரிகாலன்', 0.4744872450828552),
 ('கரிகாலனின்', 0.4527541399002075),
 ('உரியவர்கள்', 0.4411161541938782),
 ('நடத்திய', 0.43299442529678345),
 ('கரிகாலனை', 0.4311104416847229),
 ('தஞ்சைச்', 0.43104785680770874),
 ('நிறைவேற்றுவதற்கு', 0.42015257477760315)]