# Text Generation using Bidirectional LSTM and Doc2Vec models


Text generated using a LSTM ususally  provide a taste of unachievement. Generated sentences seems quite right, whith correct grammar and syntax, as if the neural network was understanding correctly the structure of a sentence. But the whole new text does not have great sense. If it is not complete nosense. 

This problem could come from the approach itself, using only LSTM to generate text word by word. 

In this method will use LTSM network to generate sequences of words. However we try to go further than a classic LSTM neural network and I will use an additional neural network (LSTM again), to select the best phrases.

The approach we use involves the following steps:
 1. **how to train a neural network to generate sentences** (i.e. sequences of words), based on existing speeches. We used a bidirectional LSTM Architecture to perform that.
 2. **how to train a neural network to select the best next sentence for given paragraph** (i.e. a sequence of sentences). We will  use  use a bidirectional LSTM archicture, in addition to a Doc2Vec model of the targeted speeches.


## 1. a Neural Network for Generating Sentences

The first step is to generate sentences in the style of a given personality.
LSTM (Long Short Term Memory) are very good for analysing sequences of values and predicting the next values from them. For example, LSTM could be a very good choice if we want to predict the very next point of a given time series.

Talking about sentences and texts ; phrases (sentences) are basically sequences of words. So,we can assume that LSTM could be usefull to generate the next word of a given sentence.


### 1.1.1. Process

In order to do that, first, we build a dictionary containing all words from the novels we want to use.

 1. read the data (the speeches we want to use),
 1. create the dictionnary of words,
 2. create the list of sentences,
 3. create the neural network,
 4. train the neural network,
 5. generate new sentences.

In [1]:
from __future__ import print_function
from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Dropout
from keras.layers import LSTM, Input, Flatten, Bidirectional
from keras.layers.normalization import BatchNormalization
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.metrics import categorical_accuracy
import numpy as np
import random
import sys
import os
import time
import codecs
import collections
from six.moves import cPickle

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


We have raw text and a lot of things have to be done to use them: split them in words list, etc.
In order to do that, I use the spacy library which is incredible to deal with texts. For this exercice, I will only use very few options from spacy.

In [2]:
#import spacy, and french model
#import spacy
#nlp = spacy.load('fr')
import en_core_web_sm
nlp = en_core_web_sm.load()

# parameters

In [3]:
#data_dir = 'data/Artistes_et_Phalanges-David_Campion'# data directory containing input.txt
save_dir = 'save' # directory to store models
seq_length = 30 # sequence length
sequences_step = 1 #step to create sequences

In [4]:
#file_list = ["101","102","103","104","105","106","107","108","109","110","111","112","201","202","203","204","205","206","207","208","209","210","211","212","213","214","301","302","303","304","305","306","307","308","309","310","311","312","313","314","401","402","403","404","405","406","407","408","409","410","411","412"]

vocab_file = "words_vocab.pkl"

# Read data

Create a list of words from raw text. We use spacy library, with a specific function to retrieve only lower character of the words and remove carriage returns (\n).

We are doing that because we want to reduce the number of potential words in  dictionnary, and we assume we do not have to avoid capital letters.

In [5]:
def create_wordlist(doc):
    wl = []
    for word in doc:
        if word.text not in ("\n","\n\n",'\u2009','\xa0'):
            wl.append(word.text.lower())
    return wl

Create the list of sentences:

In [6]:
wordlist = []

input_file =  "speech.txt"
#read data
with codecs.open(input_file, "r") as f:
    data = f.read()
#create sentences
doc = nlp(data)
wl = create_wordlist(doc)
wordlist = wordlist + wl

## Create dictionnary

The first step is to create the dictionnary, it means, the list of all words contained in texts. For each word, we will assign an index to it. 

In [7]:
# count the number of words
word_counts = collections.Counter(wordlist)

# Mapping from index to word : that's the vocabulary
vocabulary_inv = [x[0] for x in word_counts.most_common()]
vocabulary_inv = list(sorted(vocabulary_inv))

# Mapping from word to index
vocab = {x: i for i, x in enumerate(vocabulary_inv)}
words = [x[0] for x in word_counts.most_common()]

#size of the vocabulary
vocab_size = len(words)
print("vocab size: ", vocab_size)

#save the words and vocabulary
with open(os.path.join(vocab_file), 'wb') as f:
    cPickle.dump((words, vocab, vocabulary_inv), f)

vocab size:  1142


## create sequences
Now, we have to create the input data for our LSTM. We create two lists:
 - **sequences**: this list will contain the sequences of words used to train the model,
 - **next_words**: this list will contain the next words for each sequences of the **sequences** list.
 
We assume the seq_length = 30.

So, to create the first sequence of words, we take the 30th first words in the **wordlist** list. The word 31 is the next word of this first sequence, and is added to the **next_words** list.

Then we jump by a step of 1 in the list of words, to create the second sequence of words and retrieve the second "next word".

We iterate this task until the end of the list of words.

In [8]:
#create sequences
sequences = []
next_words = []
for i in range(0, len(wordlist) - seq_length, sequences_step):
    sequences.append(wordlist[i: i + seq_length])
    next_words.append(wordlist[i + seq_length])

print('nb sequences:', len(sequences))

nb sequences: 5480


When we iterate over the whole list of words, we create 5480 sequences of words, and retrieve, for each of them, the next word to be predicted.

In [9]:
X = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool)
y = np.zeros((len(sequences), vocab_size), dtype=np.bool)
for i, sentence in enumerate(sequences):
    for t, word in enumerate(sentence):
        X[i, t, vocab[word]] = 1
    y[i, vocab[next_words[i]]] = 1

# Build Model

Creation of the neural network.
 


In [10]:
def bidirectional_lstm_model(seq_length, vocab_size):
    print('Build LSTM model.')
    model = Sequential()
    model.add(Bidirectional(LSTM(rnn_size, activation="relu"),input_shape=(seq_length, vocab_size)))
    model.add(Dropout(0.6))
    model.add(Dense(vocab_size))
    model.add(Activation('softmax'))
    
    optimizer = Adam(lr=learning_rate)
    callbacks=[EarlyStopping(patience=2, monitor='val_loss')]
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=[categorical_accuracy])
    return model

In [11]:
rnn_size = 256 # size of RNN
batch_size = 32 # minibatch size
seq_length = 30 # sequence length
num_epochs = 50 # number of epochs
learning_rate = 0.001 #learning rate
sequences_step = 1 #step to create sequences

In [12]:
md = bidirectional_lstm_model(seq_length, vocab_size)
md.summary()

Build LSTM model.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_1 (Bidirection (None, 512)               2865152   
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1142)              585846    
_________________________________________________________________
activation_1 (Activation)    (None, 1142)              0         
Total params: 3,450,998
Trainable params: 3,450,998
Non-trainable params: 0
_________________________________________________________________


## train data

Enough speech, we train the model now. We shuffle the training set and extract 10% of it as validation sample. We simply run :

In [13]:
#fit the model
callbacks=[EarlyStopping(patience=4, monitor='val_loss'),
           ModelCheckpoint('my_model_gen_sentences_lstm.{epoch:02d}-{val_loss:.2f}.hdf5',\
                           monitor='val_loss', verbose=0, mode='auto', period=2)]
history = md.fit(X, y,
                 batch_size=batch_size,
                 shuffle=True,
                 epochs=num_epochs,
                 callbacks=callbacks,
                 validation_split=0.5)

Train on 2740 samples, validate on 2740 samples
Epoch 1/50
Epoch 2/50


Epoch 3/50


Epoch 4/50


Epoch 5/50




In [15]:
#save the model
md.save('my_model_gen_sentences_lstm.final.hdf5')

# Generate phrase

 we generate phrases, word by word.

In [16]:
#load vocabulary
print("loading vocabulary...")
vocab_file = "words_vocab.pkl"

with open('words_vocab.pkl', 'rb') as f:
        words, vocab, vocabulary_inv = cPickle.load(f)

vocab_size = len(words)

loading vocabulary...


In [17]:
from keras.models import load_model
# load the model
print("loading model...")
model = load_model('my_model_gen_sentences_lstm.final.hdf5')

loading model...


To improve the word generation, and tune a bit the prediction, we introduce a specific function to pick-up words.

We will not take the words with the highest prediction (or the generation of text will be boring), but would like to insert some uncertainties, and let the solution sometime pick-up words with less good prediction.

That is the purpose of the function **sample**, that will draw radomly a word from the vocabulary.

The probabilty for a word to be drawn will depends directly on its probability to be the next word. In order to tune this probability, we introduce a "temperature" to smooth or sharpen its value.

In [18]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [26]:
#initiate sentences
seed_sentences = "women empowerment"
generated = ''
sentence = []
for i in range (seq_length):
    sentence.append("a")

seed = seed_sentences.split()

for i in range(len(seed)):
    sentence[seq_length-i-1]=seed[len(seed)-i-1]

generated += ' '.join(sentence)
print('Generating text with the following seed: "' + ' '.join(sentence) + '"')

print ()

Generating text with the following seed: "a a a a a a a a a a a a a a a a a a a a a a a a a a a a women empowerment"



In [27]:
words_number = 100
#generate the text
for i in range(words_number):
    #create the vector
    x = np.zeros((1, seq_length, vocab_size))
    for t, word in enumerate(sentence):
        x[0, t, vocab[word]] = 1.
    #print(x.shape)

    #calculate next word
    preds = model.predict(x, verbose=0)[0]
    next_index = sample(preds, 0.34)
    next_word = vocabulary_inv[next_index]

    #add the next word to the text
    generated += " " + next_word
    # shift the sentence by one, and and the next word at its end
    sentence = sentence[1:] + [next_word]

print(generated)


a a a a a a a a a a a a a a a a a a a a a a a a a a a a women empowerment we , is we , , and of , us the the in . . , , . , we , . to . , the we , the it the the . and , we to , we we we we , . , to we the the , , and and we . we of done we . we the to , the the , that , we . . are thank of . , to women the . we . and have in the . and , , in . . the to the and we .


# Now we have to vectorize all sentences in the text, and try to find patterns in sequences of these vectors.

In order to do that, we will use Doc2Vec.

# 1. Doc2Vec
Doc2Vec is able to vectorize a paragraph of text.we will transform each sentences of our text in a vector of a specific space. Doing so we will be able to compare them to retrieve the most similar sentence of a given one.
So, once all sentences will be converted to vectors, we will try to **train a new bidirectional LSTM**. Its purpose will be to predict the best vector, next to a sequence of vectors.
We will generate sentences as candidates to be the next phrase. We will infer their vectors using the **trained doc2Vec model**, then pick the closest one to the prediction of our new LSTM model.

## 1.1 Create the Doc2Vec Model
The first task is to create our **doc2vec model**, dedicated to our text and embedded sentences.


In [28]:
#import gensim library
import gensim
from gensim.models.doc2vec import LabeledSentence

import numpy as np
import os
import time
import codecs



In [30]:

import en_core_web_sm
nlp = en_core_web_sm.load()

#initiate sentences and labels lists
sentences = []
sentences_label = []

#create sentences function:
def create_sentences(doc):
    ponctuation = [".","?","!",":","…"]
    sentences = []
    sent = []
    for word in doc:
        if word.text not in ponctuation:
            if word.text not in ("\n","\n\n",'\u2009','\xa0'):
                sent.append(word.text.lower())
        else:
            sent.append(word.text.lower())
            if len(sent) > 1:
                sentences.append(sent)
            sent=[]
    return sentences

#create sentences from files
input_file="speech.txt"
with codecs.open(input_file, "r") as f:
    data = f.read()
#create sentences
doc = nlp(data)
sents = create_sentences(doc)
sentences = sentences + sents
    
#create labels
for i in range(np.array(sentences).shape[0]):
    sentences_label.append("ID" + str(i))

In [31]:
class LabeledLineSentence(object):
    def __init__(self, doc_list, labels_list):
        self.labels_list = labels_list
        self.doc_list = doc_list
    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield gensim.models.doc2vec.LabeledSentence(doc,[self.labels_list[idx]])

In [32]:
save_dir = 'save'
def train_doc2vec_model(data, docLabels, size=300, sample=0.000001, dm=0, hs=1, window=10, min_count=0, workers=8,alpha=0.024, min_alpha=0.024, epoch=15, save_file='./data/doc2vec.w2v') :
    startime = time.time()
    
    print("{0} articles loaded for model".format(len(data)))

    it = LabeledLineSentence(data, docLabels)

    model = gensim.models.Doc2Vec(size=size, sample=sample, dm=dm, window=window, min_count=min_count, workers=workers,alpha=alpha, min_alpha=min_alpha, hs=hs) # use fixed learning rate
    model.build_vocab(it)
    for epoch in range(epoch):
        print("Training epoch {}".format(epoch + 1))
        model.train(it,total_examples=model.corpus_count,epochs=model.iter)
    
        
    #saving the created model
    model.save(os.path.join(save_file))
    print('model saved')

In [33]:
train_doc2vec_model(sentences, sentences_label, size=500,sample=0.0,alpha=0.025, min_alpha=0.001, min_count=0, window=10, epoch=20, dm=0, hs=1, save_file='doc2vec.w2v')

216 articles loaded for model


  import sys


Training epoch 1


  del sys.path[0]


Training epoch 2
Training epoch 3
Training epoch 4
Training epoch 5
Training epoch 6
Training epoch 7
Training epoch 8
Training epoch 9
Training epoch 10
Training epoch 11
Training epoch 12
Training epoch 13
Training epoch 14
Training epoch 15
Training epoch 16
Training epoch 17
Training epoch 18
Training epoch 19
Training epoch 20
model saved


In [36]:
#import library
from six.moves import cPickle

#load the model
d2v_model = gensim.models.doc2vec.Doc2Vec.load('doc2vec.w2v')

sentences_vector=[]

t = 500

for i in range(len(sentences)):
    if i % t == 0:
        print("sentence", i, ":", sentences[i])
        print("***")
    sent = sentences[i]
    sentences_vector.append(d2v_model.infer_vector(sent, alpha=0.001, min_alpha=0.001, steps=10000))
    
#save the sentences_vector
sentences_vector_file =  "sentences_vector_500_a001_ma001_s10000.pkl"
with open(sentences_vector_file, 'wb') as f:
    cPickle.dump((sentences_vector), f)

sentence 0 : ['i', 'want', 'to', 'start', 'by', 'appreciating', 'my', 'sisters', 'and', 'brothers', 'here', 'with', 'today', ';', 'h.e.', 'ms.', 'otiko', 'afisa', 'djaba', ',', 'minister', 'for', 'gender', ',', 'children', 'and', 'social', 'protection', ',', 'ghana', 'is', 'here', ',', 'because', 'ghana', 'in', 'the', 'african', 'union', 'is', 'the', 'champion', 'for', 'gender', 'and', 'development', '.']
***


In [37]:
nb_sequenced_sentences = 15
vector_dim = 500

X_train = np.zeros((len(sentences), nb_sequenced_sentences, vector_dim), dtype=np.float)
y_train = np.zeros((len(sentences), vector_dim), dtype=np.float)

t = 1000
for i in range(len(sentences_label)-nb_sequenced_sentences-1):
    if i % t == 0: print("new sequence: ", i)
    
    for k in range(nb_sequenced_sentences):
        sent = sentences_label[i+k]
        vect = sentences_vector[i+k]
        
        if i % t == 0:
            print("  ", k + 1 ,"th vector for this sequence. Sentence ", sent, "(vector dim = ", len(vect), ")")
            
        for j in range(len(vect)):
            X_train[i, k, j] = vect[j]
    
    senty = sentences_label[i+nb_sequenced_sentences]
    vecty = sentences_vector[i+nb_sequenced_sentences]
    if i % t == 0: print("  y vector for this sequence ", senty, ": (vector dim = ", len(vecty), ")")
    for j in range(len(vecty)):
        y_train[i, j] = vecty[j]

print(X_train.shape, y_train.shape)

new sequence:  0
   1 th vector for this sequence. Sentence  ID0 (vector dim =  500 )
   2 th vector for this sequence. Sentence  ID1 (vector dim =  500 )
   3 th vector for this sequence. Sentence  ID2 (vector dim =  500 )
   4 th vector for this sequence. Sentence  ID3 (vector dim =  500 )
   5 th vector for this sequence. Sentence  ID4 (vector dim =  500 )
   6 th vector for this sequence. Sentence  ID5 (vector dim =  500 )
   7 th vector for this sequence. Sentence  ID6 (vector dim =  500 )
   8 th vector for this sequence. Sentence  ID7 (vector dim =  500 )
   9 th vector for this sequence. Sentence  ID8 (vector dim =  500 )
   10 th vector for this sequence. Sentence  ID9 (vector dim =  500 )
   11 th vector for this sequence. Sentence  ID10 (vector dim =  500 )
   12 th vector for this sequence. Sentence  ID11 (vector dim =  500 )
   13 th vector for this sequence. Sentence  ID12 (vector dim =  500 )
   14 th vector for this sequence. Sentence  ID13 (vector dim =  500 )
   15 th

# 3. Create the Keras Model

- bidirectional LSTM,
- with size of 512 and using RELU as activation 
- then a dropout layer of 0,5.

### we create the model and run it to  predict the best vectorized-sentence, following a sequence of 15 

## Our process of our text generation will be : 
We have first to provide a seed of 15 sentences, that contain at least 30 words. then:
 1. using the last 30 words of the seed, we generate 10 candidates sentences.
 2. we infer their vectors using the doc2vec model,
 3. we calculate the "best vector" for the sentence following the 15 phrases of the seed,
 4. we compare the infered vectors with the "best vector", and pick-up the closest one.
 5. we add the generated sentence corresponding to this vector at the end of the seed, as the next sentence of the text.
 6. then, we loop over the process.