<a href="https://colab.research.google.com/github/joohoho/JaskierBulgarianPoet/blob/main/Jaskier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd

In [2]:
import string
import re
import os

from gensim.models import Word2Vec, KeyedVectors

import tensorflow as tf

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Embedding, Dense, LSTM
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import Callback, ModelCheckpoint

# Jaskier The Bulgarian Poet

### Course Project for creation of a conversational chatbot in Bulgarian language through preserving text semantics with Word2Vec and Recurrent Neural Network (RNN)

I decided to create a chatbot in the Bulgarian language that converses in rhymes. For this reason, I need to train (and test) a neural network with poems in the Bulgarian language.

Bulgarian Poems dataset are taken from [kaggle dataset](https://www.kaggle.com/datasets/auhide/bulgarian-poems). Let's have a look at it.

In [3]:
poems_table = pd.read_csv("drive/MyDrive/data/chitanka-corpus.csv")

In [4]:
poems_table.head(5)

Unnamed: 0,author,title,poem
0,Андреас Неуман,(Гравитацията и грацията),Два съда с вода.<eol>Леко пързаляне с кънки.<e...
1,Андреас Неуман,(Клаудия в библиотеката),Издирваш в книгите<eol>със странно усърдие на ...
2,Андреас Неуман,(Лолита джаз),На Хусто Наваро и Хуан де Локса<eol>І<eol>Преч...
3,Александър Радонов,(Моята) умираща мечта,"Като рак ядяща,<eol>самотата подяжда<eol>нечия..."
4,Хорхе Урутия,(Не поверителни) данни за една справка,че не пиша ръкописвам начертавам зачертавам по...


### Preprocessing

I decide to design a word-level model, so each word in the dataset is a token. We need to preprocess all texts (poems) and clean them from punctuation signs, non-textual symbols (e.g. numbers), and stop words (e.g. words that repeat a lot and bring a small amount of meaning. Let's starts with the preprocessing of the texts.

In [5]:
# Make all words in the poems lower

poems_table["poem"] = poems_table.poem.str.lower()

In [6]:
poems_list = poems_table.poem.values.tolist()

In [7]:
poems_list[0]

'два съда с вода.<eol>леко пързаляне с кънки.<eol> <eol>разсъблечени, закръглени,<eol>започваш и завършваш<eol>на краката си:<eol> <eol>съумял си да ги обучиш<eol>в малкото изкуство<eol>да изследват за тебе твърдата земя.<eol> <eol>ще оближа търпеливо<eol>тези точни пръсти.<eol> <eol>(ако беше смислен апетитът ми,<eol>бих закусил с тях,<eol>бели карамели.)<eol> <eol>оставям си тук ботушите.<eol>битките спечелват<eol>страхливите войници.<eol> <eol>гледайки ги, голи,<eol>се уча да ходя.<eol> <eol>събуй ми тези часове,<eol> направи ми път.'

We need a collection of stop words in Bulgarian to apply the cleaning process. The stop words for Bulgarian language are taken from [Stopwords ISO](https://github.com/stopwords-iso/stopwords-bg).

In [8]:
with open("drive/MyDrive/data/stopwords-bg.txt", encoding="utf8") as f:
    text = f.read()

In [9]:
stopwords = re.split("\W+", text)

Let's set some parameters that we will need further on.

In [10]:
num_texts = max(poems_table.poem.index)
min_occurance = 20
vector_size = 20
seed = 11

Let's make a function to gather all the preproccessed words together. Then we will make a set of unique words out of it.

In [11]:
# Create a corpus with repeating preprocessed words

all_poems = []

def make_corpus(data):
  """
  This is a function to gather all the preprocessed words together to make a dictionary.

  Parameters:
  -----------
  Data to be processed. Should be a list.

  Returns:
  --------
  A list object that is a dictionary with all the words of interest from the texts
  """
  for i in range(num_texts):

    text = data[i].split()
    text = [word for word in text if word not in string.punctuation] # Removing punctuations
    text = [word for word in text if word.isalpha()] # Removing all the words having characters other than letters
    text = [word for word in text if word not in stopwords] #Removing all the stop words

    seq = ' '.join(text)
    split_seq = seq.split()

    for index in range(len(split_seq)):  #Pputting all the words together
      all_poems.append(split_seq[index])

  return all_poems

In [60]:
# These are all the prepreccessed words with repeating

corpus = make_corpus(poems_list)
corpus[0:100]

['съда',
 'пързаляне',
 'краката',
 'малкото',
 'изследват',
 'тебе',
 'твърдата',
 'оближа',
 'точни',
 'смислен',
 'апетитът',
 'закусил',
 'уча',
 'издирваш',
 'странно',
 'усърдие',
 'профил',
 'начин',
 'свиваш',
 'напрягаш',
 'бавна',
 'татуировка',
 'кръста',
 'отколкото',
 'обрисуваш',
 'осмелява',
 'свързват',
 'думата',
 'нейния',
 'преструвам',
 'обърнеш',
 'поглед',
 'твърде',
 'правиш',
 'хусто',
 'наваро',
 'хуан',
 'де',
 'капризно',
 'изобилие',
 'коляното',
 'точното',
 'отражение',
 'ръкавица',
 'непромокаемия',
 'шлифер',
 'уста',
 'изисква',
 'своя',
 'тъмната',
 'страна',
 'неувереност',
 'леката',
 'страст',
 'започнеш',
 'обичаш',
 'my',
 'мента',
 'моят',
 'орех',
 'езика',
 'става',
 'бръмбарче',
 'моя',
 'връхчето',
 'my',
 'запалиш',
 'невидимо',
 'обсипано',
 'лъже',
 'пясъчният',
 'часовник',
 'отидеш',
 'мене',
 'твоя',
 'неначенати',
 'тестета',
 'порокът',
 'ягодови',
 'асото',
 'спечели',
 'приличаше',
 'ухание',
 'my',
 'charming',
 'little',
 'дъжд',


In [13]:
type(corpus), len(corpus)

(list, 572629)

In [14]:
# This is a unique set of words - our dictionary

set_corpus = set(corpus)

In [15]:
len(set_corpus)

81153

We need a vocabulary that shows an index for each unique word. And also an inverse vocabulary.

In [16]:
# Create a vocabulary

vocab, index = {}, 1  # start indexing from 1
vocab["<UNK>"] = 0  # add a unknown token
for token in set_corpus:
  if token not in vocab:
    vocab[token] = index
    index += 1

In [17]:
len(vocab)

81154

In [18]:
vocab["дръзнах"]

57964

In [19]:
# Create an inverse vocabulary

inverse_vocab = {index: token for token, index in vocab.items()}

In [20]:
inverse_vocab[vocab["дръзнах"]]

'дръзнах'

We would be preprocessing our poems again keeping only the words that are present in our dictionary list.

In [21]:
def make_corpus_sequences(data):
  """
  This is a function to clean the texts and leave words of interest in sequences.

  Parameters:
  -----------
  data - the text to be cleaned up.

  Returns:
  --------
  A cleaned sequence with elements that match preproccessing rules.
  """
  text = data.split()
  text = [word for word in text if word in set_corpus] # Removing words that are not in "my_dict"

  seq = ' '.join(text)
  seq = seq.split()

  return seq

In [22]:
corpus_sequences = []
for i in range(num_texts):
    clean_text = make_corpus_sequences(poems_table.poem[i]) # cleaning up the poem
    corpus_sequences.append(clean_text) # adding the preprocessed poem to our list

The next step is to encode each word as a sequence of integers. We need to present the tokens (words) as a vector with numbers. There are many ways to do it. We can simply give a unique number of each word and then make it as a one-hot-encoded vector with fixed length, using padding. However this won't preserve the place of that word in our latent space. Those vectors won't have any idea about the semantics of our texts.

That's why I decided to use word2vec algorithm.

In [23]:
word2vec = Word2Vec(corpus_sequences, min_count = min_occurance, vector_size = vector_size, workers = 3, seed = seed)

In [24]:
print(word2vec)

Word2Vec<vocab=4823, vector_size=20, alpha=0.025>


In [25]:
vocab_size = len(word2vec.wv)

I reduce the size of the vocabulary, by using the hyperparameter "min_count", set to 20. This excludes all the words that are with frequency in the corpus, less than that. From initial 81153 to 4823 words to be trained.

In [26]:
word2vec.train(corpus_sequences, total_examples = num_texts, epochs = 10)



(3565477, 5726290)

Let's save the model in "bin" format.

In [27]:
word2vec.save("/content/drive/MyDrive/preprocessed_data_Jaskier/word2vec.bin")

Load the model and check similar and dissimilar words.

In [28]:
word2vec = Word2Vec.load("/content/drive/MyDrive/preprocessed_data_Jaskier/word2vec.bin")

In [29]:
similar_words = word2vec.wv.most_similar("лазур")
similar_words

[('сребърна', 0.9492636919021606),
 ('вечерен', 0.9452553987503052),
 ('есенен', 0.9373318552970886),
 ('лунен', 0.9320693016052246),
 ('зрак', 0.929793655872345),
 ('вечерна', 0.9293441772460938),
 ('дъхът', 0.9286230206489563),
 ('лунните', 0.9264757037162781),
 ('унес', 0.9253934621810913),
 ('просторите', 0.9234907031059265)]

In [30]:
dissimlar_words = word2vec.wv.doesnt_match("чуваш първите стъпки поглед".split())
dissimlar_words

'поглед'

In [31]:
vector_sample = word2vec.wv["мирис"]
vector_sample

array([ 0.84118867, -1.0650666 , -0.40173572, -0.57912475, -0.6908556 ,
       -0.00289592,  0.9373853 ,  0.1081099 ,  1.1116147 , -0.9885714 ,
       -0.08186585, -0.549227  ,  0.7389428 , -0.16801046,  0.85311544,
       -0.5582539 ,  0.08983558, -0.1169704 ,  0.12766753,  0.16724469],
      dtype=float32)

In [32]:
# Let's save and reload the vectors

word2vec.wv.save("/content/drive/MyDrive/preprocessed_data_Jaskier/vectors.kv")
word_vectors = KeyedVectors.load("/content/drive/MyDrive/preprocessed_data_Jaskier/vectors.kv")

In [33]:
word_vectors[11]

array([ 1.2204913 , -2.6143334 ,  0.72949004, -0.42397836, -0.11209559,
        0.14924559, -0.1384469 , -0.01623715,  0.2194805 , -0.43451595,
       -0.8399127 , -0.14398257,  1.4419646 , -1.148511  , -1.4091693 ,
       -1.8123314 ,  2.943314  ,  2.8133228 ,  0.3327512 , -1.445299  ],
      dtype=float32)

The trained weights are in model.wv.vectors, which is a 2D matrix of shape (number of words, dimensionality of word vectors).

In [34]:
weigths = word2vec.wv.vectors
weigths.shape

(4823, 20)

The mapping between the word indices in this matrix (integers) and the words themselves (strings) is in model.wv.index_to_key.

In [35]:
# brings the original list of words / our vocabulary

word2vec.wv.index_to_key[11]

'очите'

### Train and Test Split

In [36]:
# Test propertion is approximately 15%

poems_test = corpus[0:85880]
len(poems_test)

85880

In [37]:
poems_train = corpus[85880:572620]
len(poems_train)

486740

In [38]:
test = corpus[0:50]

In [39]:
del corpus

### Vectorize data

Each poem has a different length, due to the text origin itself. Unfortunately, machine learning doesn't understand speech logic and we need to find a way to make each piece of text equal in length, e.g. with the same shape of the vector that represents it.

This simple task appeared to be much more complicated at a second sight.

My first approach was to take word2vec vectors in the sequences of the text and average them to get one vector of a fixed second dimension per text. However I have restraints with this approach as I don't see the logic to feed the model with something that is close to all words, but is not exactly any of them. So I decided to save the weights from the trained word2vec model in an Embedding layer and to feed my model with dense vectors.

I continued with functions creation for preproccessing.

In [40]:
# Fuction to iterate over the poems' words and return dense vectors to feed the model

vectors_list = []

def make_dense_vectors(data):

  for i in range(0, len(data)):
    if data[i] not in vocab.keys():
      number = 0
    else:
      number = vocab[data[i]]

    vectors_list.append(number)

  return vectors_list

In [41]:
# Function to make input chunks of data

input_chunks_list = []

def make_input_chunks(data):
  num_elements = len(make_dense_vectors(data))
  step = chunk_size

  for i in range(0, num_elements, step):
    chunks = make_dense_vectors(data)[i: i + chunk_size]
    input_chunks_list.append(chunks)

  if len(input_chunks_list[-1]) < 4:
    del input_chunks_list[-1]

  return input_chunks_list

In [42]:
# Function to make target chunks of data

target_chunks_list = []

def make_target_chunks(data):
  num_elements = len(make_dense_vectors(data))
  step = chunk_size

  for i in range(4, num_elements, step):
    chunks = make_dense_vectors(data)[i: i + chunk_size]
    target_chunks_list.append(chunks)

  if len(target_chunks_list[-1]) < 4:
    del target_chunks_list[-1]

  target_chunks_list.append([0,0,0,0])

  return target_chunks_list

In [43]:
chunk_size = 4

In [44]:
a = make_dense_vectors(test)

In [45]:
b = (list(a))

In [46]:
b.extend([0,0,0,0])

In [47]:
x = np.array(a, dtype = "float32")

In [48]:
y = np.array(b[4:], dtype = "float32")

In [49]:
x.shape, y.shape

((50,), (50,))

We need to one-hot encode the target_chunks list to be ready to be fitted in the model.

### Model creation using word2vec embeddings

We need to create a sequence to sequence model for next token prediction. The architecture I chose is model that uses encoder-decoder structure, introduced by Google. The encoder processes the input sequence and transforms it into a fixed-size hidden representation. The decoder uses the hidden representation to generate output sequence. The encoder-decoder structure allows them to handle input and output sequences of different lengths, making them capable to handle sequential data.

1. A RNN layer acts as "encoder": it processes the input sequence and returns its own internal state. Note that we discard the outputs of the encoder RNN, only recovering the state. This state will serve as the "context", or "conditioning", of the decoder in the next step.
2. Another RNN layer acts as "decoder": it is trained to predict the next characters of the target sequence, given previous characters of the target sequence. Specifically, it is trained to turn the target sequences into the same sequences but offset by one timestep in the future, a training process called "teacher forcing" in this context. Importantly, the encoder uses as initial state the state vectors from the encoder, which is how the decoder obtains information about what it is supposed to generate. Effectively, the decoder learns to generate targets[t+1...] given targets[...t], conditioned on the input sequence.

In [50]:
def gensim_to_keras_embedding(model, train_embeddings=False):
    """Get a Keras 'Embedding' layer with weights set from Word2Vec model's learned word embeddings.

    Parameters
    ----------
    train_embeddings : bool
        If False, the returned weights are frozen and stopped from being updated.
        If True, the weights can / will be further updated in Keras.

    Returns
    -------
    "keras.layers.Embedding"
        Embedding layer, to be used as input to deeper network layers.

    """
    keyed_vectors = model.wv  # structure holding the result of training
    weights = keyed_vectors.vectors  # vectors themselves, a 2D numpy array
    index_to_key = keyed_vectors.index_to_key  # which row in `weights` corresponds to which word?

    layer = Embedding(
        input_dim = weights.shape[0],
        output_dim = weights.shape[1],
        weights = [weights],
        trainable = train_embeddings,
    )
    return layer

In [51]:
weigths.shape

(4823, 20)

In [53]:
batch_size = 1  # Batch size for training
epochs = 10  # Number of epochs to train for
latent_dim = 128  # Latent dimensionality of the encoding space
num_decoder_tokens = vocab_size

In [None]:
# Test data, the model works with

# encoder_input_data = np.zeros((1000, 7), dtype="float32")
# decoder_input_data = np.zeros((1000, 16), dtype="float32")
# decoder_target_data = np.zeros((1000, 16, num_decoder_tokens), dtype="float32")

### Pipeline

The most efficient way is to create a pipeline for proccessing the data. It ensures the reusability of the workflow and also gives us a fast execution.

In [54]:
encoder_input_data = tf.data.Dataset.from_tensor_slices(x) \
  .batch(4, drop_remainder=True)

In [None]:
decoder_input_data = tf.data.Dataset.from_tensor_slices(y) \
  .batch(4, drop_remainder = True)

In [55]:
def one_hot_encode(vector):
  vector = tf.one_hot(y, depth = num_decoder_tokens)
  return vector

In [56]:
decoder_target_data = tf.data.Dataset.from_tensor_slices(y) \
  .map(one_hot_encode) \
  .batch(4, drop_remainder = True)

In [None]:
encoder_input_data.element_spec, decoder_input_data.element_spec, decoder_target_data.element_spec

(TensorSpec(shape=(4,), dtype=tf.float32, name=None),
 TensorSpec(shape=(4,), dtype=tf.float32, name=None),
 TensorSpec(shape=(4, 50, 4823), dtype=tf.float32, name=None))

Unfortunately with this approach I keep getting the following error:

ValueError: Failed to find data adapter that can handle input: (<class 'list'> containing values of types {"<class 'tensorflow.python.data.ops.batch_op._BatchDataset'>"}), <class 'tensorflow.python.data.ops.batch_op._BatchDataset'>

I spent huge amount of time debugging it without a success.

I also tryied to make a generator and save smaller batches of the preproccessed data.

However the last approach didn't have any success as the functions I used are extremely slow and not good to have in your project. They consume high RAM and takes forever to end.

Below is an example of this "poor" code.

In [59]:
# Define an input sequence and process it.

encoder_inputs = Input(shape=(None,))
x = gensim_to_keras_embedding(word2vec)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim,
                           return_state=True)(x)
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x)

# Define the model that will turn "encoder_input_data" and "decoder_input_data" into "decoder_target_data"
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
# Compile and run training
model.compile(optimizer=RMSprop(learning_rate=0.001), loss="categorical_crossentropy")

In [None]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 embedding (Embedding)       (None, None, 20)             96460     ['input_1[0][0]']             
                                                                                                  
 embedding_1 (Embedding)     (None, None, 128)            617344    ['input_2[0][0]']             
                                                                                              

In [None]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [None]:
# Note that "decoder_target_data" needs to be one-hot encoded, rather than sequences of integers like "decoder_input_data"!
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          # batch_size=batch_size,
          epochs=epochs,
          callbacks=[ModelCheckpoint("./checkpoints/"), tensorboard_callback],
          # validation_split=0.2,
          shuffle = False)

ValueError: Failed to find data adapter that can handle input: (<class 'list'> containing values of types {"<class 'tensorflow.python.data.ops.batch_op._BatchDataset'>"}), <class 'tensorflow.python.data.ops.batch_op._BatchDataset'>

In [None]:
os.makedirs("./checkpoints/")

FileExistsError: [Errno 17] File exists: './checkpoints/'

In [None]:
model.save()

In [None]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [None]:
predictions = model.predict()

We receive for every vector another vector with probabilities.

In [None]:
tf.argmax(predictions, axis = 1)

TensorBoard now shows the word2vec model's accuracy and loss:

In [None]:
# docs_infra: no_execute
%load_ext tensorboard

%tensorboard --logdir logs

### References

Articles of Tomas Mikolov:

[1. First](https://arxiv.org/abs/1301.3781)

[2. Second](https://arxiv.org/abs/1310.4546)

[Word2Vec Documentation](https://radimrehurek.com/gensim/apiref.html)

[Piskvorky Git Repo](https://github.com/piskvorky/gensim/wiki/Using-Gensim-Embeddings-with-Keras-and-Tensorflow)

[Article "Sequence to Sequence Learning with Neural Networks"](https://arxiv.org/abs/1409.3215)

[Machine Learning Mastery](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)

[Medium Article](https://medium.com/@dilip.voleti/classification-using-word2vec-b1d79d375381)

[Keras Guide to Seq2Seq models](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html)

[Seq2Seq model](https://cnvrg.io/seq2seq-model/)