<a href="https://colab.research.google.com/github/nguyetvo/Nguyet-ML2-Online-042020/blob/master/2_select_next_sentence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation using Bidirectional LSTM and Doc2Vec models 2/3

If you have reached directly this page, I suggest to start reading the first part of this article. It describes how to create a RNN model to generate a text, word after word.

I finished the first part of the article explaining I will try to improve the generation of sentences, by detecting patterns in the sequences of sentences, not only in the sequences of words.

It could be an improvement, because doing that, the context of a paragraph (is it a description of a countryside? a dialog between characters? which people are involved? what are the previous actions? etc.) could emerge and can be used to select wisely the next sentence of the text.

The process will be similar to the previous one, however, I will have to vectorize all sentences in the text, and try to find patterns in sequences of these vectors.

In order to do that, we will use **Doc2Vec**.

# 1. Doc2Vec
Doc2Vec is able to vectorize a paragraph of text. If you do not know it, I suggest to have a look on the gensim web site, that describes how its work and what you’re allowed to do with it.

In a short, we will transform each sentences of our text in a vector of a specific space. The great thing of the approach is we will be able to compare them ; by example, to retrieve the most similar sentence of a given one.

Last but not least, the dimension of the vectors will be the same, whatever is the number of words in their linked sentence.

It is exactly what we are looking for: I will be able to train a new LSTM, trying to catch pattern from sequences of vectors of the same dimensions.

``I have to be honest: I am not sure we can perform such task with enough accuracy, but let’s have some tests. It is an experiment, at worst, it will be a good exercice.``

So, once all sentences will be converted to vectors, we will try to **train a new bidirectional LSTM**. It purpose will be to predict the best vector, next to a sequence of vectors.

Then how will we generate text ?

Pretty easy: thanks to our previous LSTM model, we will generate sentences as candidates to be the next phrase. We will infer their vectors using the **trained doc2Vec model**, then pick the closest one to the prediction of our new LSTM model.

## 1.1 Create the Doc2Vec Model
The first task is to create our **doc2vec model**, dedicated to our text and embedded sentences.

**Doc2Vec** assumes its input to be a list a words, with a label, per sentence:

``Example: ['tobus', 'ouvre', 'la', 'porte', '.'] LABEL1``

So we have to extract from the text each sentences and splits their words.

by convention, I assume a sentence ends with “.”,”?”,”!”,”:” or “…”. The script reads each text, and create a new sentence each time it reaches on of these characters.

First, we load the Doc2Vec library, we load our data and set some parameter:

- all texts are stored in the **data_dir** directory,
- the **file_list** list contains the names of all text files in the **data_dir** directory,
- the **save_dir** will be used to save models.


In [0]:
#import gensim library
import gensim
from gensim.models.doc2vec import LabeledSentence

import numpy as np
import os
import time
import codecs

#parameters
data_dir = 'data/Artistes_et_Phalanges-David_Campion'# data directory containing input.txt
save_dir = 'save' # directory to store models
file_list = ["101","102","103","104","105","106","107","108","109","110","111","112","201","202","203","204","205","206","207","208","209","210","211","212","213","214","301","302","303","304","305","306","307","308","309","310","311","312","313","314","401","402","403","404","405","406","407","408","409","410","411","412"]


Using TensorFlow backend.


I create the list of sentences for the doc2vec model: to split easily sentences, I use the **spaCy** library. Then, I create the a list of Labels for these sentences.

In [0]:

#import spacy, and french model
import spacy
nlp = spacy.load('fr')

#initiate sentences and labels lists
sentences = []
sentences_label = []

#create sentences function:
def create_sentences(doc):
    ponctuation = [".","?","!",":","…"]
    sentences = []
    sent = []
    for word in doc:
        if word.text not in ponctuation:
            if word.text not in ("\n","\n\n",'\u2009','\xa0'):
                sent.append(word.text.lower())
        else:
            sent.append(word.text.lower())
            if len(sent) > 1:
                sentences.append(sent)
            sent=[]
    return sentences

#create sentences from files
for file_name in file_list:
    input_file = os.path.join(data_dir, file_name + ".txt")
    #read data
    with codecs.open(input_file, "r") as f:
        data = f.read()
    #create sentences
    doc = nlp(data)
    sents = create_sentences(doc)
    sentences = sentences + sents
    
#create labels
for i in range(np.array(sentences).shape[0]):
    sentences_label.append("ID" + str(i))

## 1.2 Train doc2vec model
As explained above, **doc2vec** required its inputs to be correctly shaped. In order to do that, we define a specific class:

In [0]:
class LabeledLineSentence(object):
    def __init__(self, doc_list, labels_list):
        self.labels_list = labels_list
        self.doc_list = doc_list
    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield gensim.models.doc2vec.LabeledSentence(doc,[self.labels_list[idx]])

I also create a specific function to train the doc2vec model. Its purpose is to update easily training paramaters:

In [0]:
def train_doc2vec_model(data, docLabels, size=300, sample=0.000001, dm=0, hs=1, window=10, min_count=0, workers=8,alpha=0.024, min_alpha=0.024, epoch=15, save_file='./data/doc2vec.w2v') :
    startime = time.time()
    
    print("{0} articles loaded for model".format(len(data)))

    it = LabeledLineSentence(data, docLabels)

    model = gensim.models.Doc2Vec(size=size, sample=sample, dm=dm, window=window, min_count=min_count, workers=workers,alpha=alpha, min_alpha=min_alpha, hs=hs) # use fixed learning rate
    model.build_vocab(it)
    for epoch in range(epoch):
        print("Training epoch {}".format(epoch + 1))
        model.train(it,total_examples=model.corpus_count,epochs=model.iter)
        # model.alpha -= 0.002 # decrease the learning rate
        # model.min_alpha = model.alpha # fix the learning rate, no decay
        
    #saving the created model
    model.save(os.path.join(save_file))
    print('model saved')

few notes regarding the parameters of the function: the default parameters have been chosen empirically.

Now, it's time to train the **doc2vec model**. Simply run the command:


In [0]:
train_doc2vec_model(sentences, sentences_label, size=500,sample=0.0,alpha=0.025, min_alpha=0.001, min_count=0, window=10, epoch=20, dm=0, hs=1, save_file='./data/doc2vec.w2v')

12273 articles loaded for model
Training epoch 1
Training epoch 2
Training epoch 3
Training epoch 4
Training epoch 5
Training epoch 6
Training epoch 7
Training epoch 8
Training epoch 9
Training epoch 10
Training epoch 11
Training epoch 12
Training epoch 13
Training epoch 14
Training epoch 15
Training epoch 16
Training epoch 17
Training epoch 18
Training epoch 19
Training epoch 20
model saved


Here are some insights for the used parameters:

- **dimensions**: 300 dimensions seem to work well for classic subjects. In my case, after few tests, I prefer to choose 500 dimensions,
- **epochs**: below 10 epochs, results are not good enough (similiary are not working well), and bigger number of epochs creates to much similar vectors. So I chose 20 epochs for the training.
- **min_count**: I want to integrate all words in the training, even those with very few occurence. Indeed, I assume that, for my exercice, specific words could be important. I set the value to 0, but 3 to 5 should be OK.
- **sample**: *0.0*. I do not want to downsample randomly higher-frequency words, so I disabled it.
- **hs and dm**: Each time I want to infer a new vector from the trained model, for a given sentence, I want to have the same output vector. In order to do that (strangly it’s not so intuitive), I need to use a distributed bag of words as *training algorithm (dm=0)* and *hierarchical softmax (hs=1)*. Indeed, for my purpose, distributive memory and negative sampling seems to give less good results.

# 2. Create the Input Dataset
First, using my trained **doc2Vec**, I will infer the vector for all sentences of my texts. The **doc2vec** model will provide directly the vector of each sentence, we just have to iterate over the whole sentences list:

In [0]:
#import library
from six.moves import cPickle

#load the model
d2v_model = gensim.models.doc2vec.Doc2Vec.load('./data/doc2vec.w2v')

sentences_vector=[]

t = 500

for i in range(len(sentences)):
    if i % t == 0:
        print("sentence", i, ":", sentences[i])
        print("***")
    sent = sentences[i]
    sentences_vector.append(d2v_model.infer_vector(sent, alpha=0.001, min_alpha=0.001, steps=10000))
    
#save the sentences_vector
sentences_vector_file = os.path.join(save_dir, "sentences_vector_500_a001_ma001_s10000.pkl")
with open(os.path.join(sentences_vector_file), 'wb') as f:
    cPickle.dump((sentences_vector), f)

sentence 0 : ['—', 'non', ',', 'celui-là', 'n’', 'y', 'parvient', 'pas', 'non', 'plus', '!']
***
sentence 500 : ['ce', 'bouclier', 'écarte', 'et', 'détourne', 'les', 'eaux', 'du', 'fleuve', ',', 'n’', 'en', 'laissant', 'filtrer', 'qu’', 'une', 'infime', 'partie', 'pour', 'l’', 'usage', 'des', 'citadins', '.']
***
sentence 1000 : ['essayez', 'de', 'distinguer', 'les', 'fils', 'du', 'temps', '.']
***
sentence 1500 : ['le', 'plafond', 'est', 'couvert', 'de', 'chevaux', 'au', 'galop', 'directement', 'peints', 'sur', 'la', 'pierre', '.']
***
sentence 2000 : ['—', 'je', 'ne', 'suis', 'pas', 'le', 'penangis', '…']
***
sentence 2500 : ['ce', 'que', 'je', 'souhaite', ',', 'c’', 'est', 'que', 'vous', 'soyez', 'prêts', 'au', 'moment', 'voulu', '.']
***
sentence 3000 : ['le', 'jeune', 'homme', 'hausse', 'les', 'épaules', 'et', 'se', 'dirige', 'vers', 'la', 'sortie', '.']
***
sentence 3500 : ['silvi', 'attrape', 'le', 'premier', 'livre', 'qui', 'lui', 'tombe', 'sous', 'la', 'main', '.']
***
sentenc

Note: I do not use vectors generated during the training, because I want to compare them to vectors infered for sentences it did not seen. It’s better to generate them in the same way.

Now, in order to create the Keras input data set **(X_train, y_train)**, we have to folow these guidelines:

- 15 sequenced vectors from doc2vec as input,
- the next vector (16th) as output.

so, the dimension of X_train must be **(number of sequences, 15, 500)** and the dimension of y_train: **(number of sequences, 500)**

In [0]:
nb_sequenced_sentences = 15
vector_dim = 500

X_train = np.zeros((len(sentences), nb_sequenced_sentences, vector_dim), dtype=np.float)
y_train = np.zeros((len(sentences), vector_dim), dtype=np.float)

t = 1000
for i in range(len(sentences_label)-nb_sequenced_sentences-1):
    if i % t == 0: print("new sequence: ", i)
    
    for k in range(nb_sequenced_sentences):
        sent = sentences_label[i+k]
        vect = sentences_vector[i+k]
        
        if i % t == 0:
            print("  ", k + 1 ,"th vector for this sequence. Sentence ", sent, "(vector dim = ", len(vect), ")")
            
        for j in range(len(vect)):
            X_train[i, k, j] = vect[j]
    
    senty = sentences_label[i+nb_sequenced_sentences]
    vecty = sentences_vector[i+nb_sequenced_sentences]
    if i % t == 0: print("  y vector for this sequence ", senty, ": (vector dim = ", len(vecty), ")")
    for j in range(len(vecty)):
        y_train[i, j] = vecty[j]

print(X_train.shape, y_train.shape)

new sequence:  0
   1 th vector for this sequence. Sentence  ID0 (vector dim =  500 )
   2 th vector for this sequence. Sentence  ID1 (vector dim =  500 )
   3 th vector for this sequence. Sentence  ID2 (vector dim =  500 )
   4 th vector for this sequence. Sentence  ID3 (vector dim =  500 )
   5 th vector for this sequence. Sentence  ID4 (vector dim =  500 )
   6 th vector for this sequence. Sentence  ID5 (vector dim =  500 )
   7 th vector for this sequence. Sentence  ID6 (vector dim =  500 )
   8 th vector for this sequence. Sentence  ID7 (vector dim =  500 )
   9 th vector for this sequence. Sentence  ID8 (vector dim =  500 )
   10 th vector for this sequence. Sentence  ID9 (vector dim =  500 )
   11 th vector for this sequence. Sentence  ID10 (vector dim =  500 )
   12 th vector for this sequence. Sentence  ID11 (vector dim =  500 )
   13 th vector for this sequence. Sentence  ID12 (vector dim =  500 )
   14 th vector for this sequence. Sentence  ID13 (vector dim =  500 )
   15 th

   11 th vector for this sequence. Sentence  ID7010 (vector dim =  500 )
   12 th vector for this sequence. Sentence  ID7011 (vector dim =  500 )
   13 th vector for this sequence. Sentence  ID7012 (vector dim =  500 )
   14 th vector for this sequence. Sentence  ID7013 (vector dim =  500 )
   15 th vector for this sequence. Sentence  ID7014 (vector dim =  500 )
  y vector for this sequence  ID7015 : (vector dim =  500 )
new sequence:  8000
   1 th vector for this sequence. Sentence  ID8000 (vector dim =  500 )
   2 th vector for this sequence. Sentence  ID8001 (vector dim =  500 )
   3 th vector for this sequence. Sentence  ID8002 (vector dim =  500 )
   4 th vector for this sequence. Sentence  ID8003 (vector dim =  500 )
   5 th vector for this sequence. Sentence  ID8004 (vector dim =  500 )
   6 th vector for this sequence. Sentence  ID8005 (vector dim =  500 )
   7 th vector for this sequence. Sentence  ID8006 (vector dim =  500 )
   8 th vector for this sequence. Sentence  ID8007 

# 3. Create the Keras Model

Great, let's create the model now…

First, we load the library and create the function to define a simple keras Model:

- bidirectional LSTM,
- with size of 512 and using RELU as activation (very small, but quicker to perform the test),
- then a dropout layer of 0,5.
- The network will not provide me a probability but directly the next vector for a given sequence. So I finish it with a simple dense layer of the size of the vector dimension.

I use ADAM as optimizer and the loss calculation is done using **logcosh**.

In [0]:
from __future__ import print_function
from keras import regularizers
from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Dropout, Embedding, Flatten, Bidirectional, Input, LSTM
from keras.callbacks import EarlyStopping,ModelCheckpoint
from keras.optimizers import Adam
from keras.metrics import categorical_accuracy, mean_squared_error, mean_absolute_error, logcosh
from keras.layers.normalization import BatchNormalization

def bidirectional_lstm_model(seq_length, vector_dim):
    print('Building LSTM model...')
    model = Sequential()
    model.add(Bidirectional(LSTM(rnn_size, activation="relu"),input_shape=(seq_length, vector_dim)))
    model.add(Dropout(0.5))
    model.add(Dense(vector_dim))
    
    optimizer = Adam(lr=learning_rate)
    callbacks=[EarlyStopping(patience=2, monitor='val_loss')]
    model.compile(loss='logcosh', optimizer=optimizer, metrics=['acc'])
    print('LSTM model built.')
    return model

Then we create the model:

In [0]:
rnn_size = 512 # size of RNN
vector_dim = 500
learning_rate = 0.0001 #learning rate

model_sequence = bidirectional_lstm_model(nb_sequenced_sentences, vector_dim)

Building LSTM model...
LSTM model built.


And we train it:

In [0]:
batch_size = 30 # minibatch size

callbacks=[EarlyStopping(patience=3, monitor='val_loss'),
           ModelCheckpoint(filepath=save_dir + "/" + 'my_model_sequence_lstm.{epoch:02d}.hdf5',\
                           monitor='val_loss', verbose=1, mode='auto', period=5)]

history = model_sequence.fit(X_train, y_train,
                 batch_size=batch_size,
                 shuffle=True,
                 epochs=40,
                 callbacks=callbacks,
                 validation_split=0.1)

#save the model
model_sequence.save(save_dir + "/" + 'my_model_sequence_lstm.final2.hdf5')

Train on 11045 samples, validate on 1228 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40


Great ! After few hours of training, we have trained a model to predict the next best sentence vector for a given sequence of sentences.

Few remarks regarding the results:

the loss drop to 0.1049, the accuracy is around 14%,
the val_loss is around 0.1064 with val accuracy around 16%.

# 4. Conclusion
As you probably noticed, the raw result of the neural networks trained during the tutorial is not "amazing"… We can probably do better.

However, let's check if the exercice is good enough to select the best next sentence of a text. I hope it will be fair enough for my test, indeed, for a given sequence of sentences, there is no clear determinism in the sequence to be chosen.

In order to test that, we have to, for a given sequence of sentences:

- generate, using our **first LSTM model**, different candidates of sentences,
- Infer their vectors using our **doc2vec model**,
- Generate, using our **second LSTM model**, the best following vector for the sequence,
- then select the most similar vector.
- That’s what I’ll try to do in the next part of this experiment…

Thanks for reading !