<a href="https://colab.research.google.com/github/nguyetvo/Nguyet-ML2-Online-042020/blob/master/3_generate_paragraph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation

It is now time to generate some text from the models we trained!

As a recap:
 - we trained a **first bidirectional LSTM model** to predict the next word of a given sequence of 30 words.
 - we train a **doc2vec model** for the whole input text as space, sentence based,
 - we trained a **second bidirectional LSTM model** to predict the best vectorized-sentence, following a sequence of 15 vectorized-phrases.
 
So, what will be the process of our text generation ? We have first to provide a seed of 15 sentences, that contain at least 30 words. then:
 1. using the last 30 words of the seed, we generate 10 candidates sentences.
 2. we infer their vectors using the doc2vec model,
 3. we calculate the "best vector" for the sentence following the 15 phrases of the seed,
 4. we compare the infered vectors with the "best vector", and pick-up the closest one.
 5. we add the generated sentence corresponding to this vector at the end of the seed, as the next sentence of the text.
 6. then, we loop over the process.
 
## 0. import libraries and parameters

In order to start, we have to import our models and retrieve our vocabulary.

In [0]:
from __future__ import print_function
import numpy as np
import os
import scipy
from six.moves import cPickle

In [0]:
save_dir = 'save' # directory to store models

In [0]:
#import spacy, and french model
import spacy
nlp = spacy.load('fr')

In [0]:
#import gensim library
import gensim
from gensim.models.doc2vec import LabeledSentence

#load the doc2vec model
print("loading doc2Vec model...")
d2v_model = gensim.models.doc2vec.Doc2Vec.load('./data/doc2vec.w2v')

print("model loaded!")

loading doc2Vec model...
model loaded!


In [0]:
#load vocabulary
print("loading vocabulary...")
vocab_file = os.path.join(save_dir, "words_vocab.pkl")

with open(os.path.join(save_dir, 'words_vocab.pkl'), 'rb') as f:
        words, vocab, vocabulary_inv = cPickle.load(f)

vocab_size = len(words)
print("vocabulary loaded !")

loading vocabulary...
vocabulary loaded !


In [0]:
from keras.models import load_model
# load the keras models
print("loading word prediction model...")
model = load_model(save_dir + "/" + 'my_model_gen_sentences_lstm.final.hdf5')
print("model loaded!")
print("loading sentence selection model...")
model_sequence = load_model(save_dir + "/" + 'my_model_sequence_lstm.final.hdf5')
print("model loaded!")

loading word prediction model...
model loaded!
loading sentence selection model...
model loaded!


# 1. Functions to generate Candidates Sentences


To improve the text generation, and tune a bit the word prediction, we introduce a specific function to pick-up words from our vocabulary.

We will not take the words with the highest prediction (or the generation of text will be boring), but would like to insert some uncertainties, and let the solution, sometime, to pick-up words with less good prediction.

That is the purpose of the function **sample()**, that will draw randomly a word from our vocabulary.

However, the probability for a word to be drawn will depends directly on its probability to be the next word, thanks to our first bidirectional LSTM Model.

In order to tune this probability, we introduce a "temperature" to smooth or sharpen its value.
 - **if _temperature = 1.0_**, the probability for a word to be drawn is equal to the probability for the word to be the next one in the sequence (output of the owrd prediction model),
 - **if _temperature_ is big (much bigger than 1)**, the range of probabilities is shorten: the probabilities for all words to be the next one is closer to 1. More variety of words will be picked-up from the vocabulary.
 - **if _temperatune_ is small (close to 0)**, small probabilities will be avoided (they will be set closed to 0). Less words will be picked-up from the vocabulary.

In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

The **create_seed()** function is usefull to prepare seed sequences, especially if the number of words in the seed phrase is lower than the espected number for a sequence.

In [0]:
def create_seed(seed_sentences,nb_words_in_seq=20, verbose=False):
    #initiate sentences
    generated = ''
    sentence = []
    
    #fill the sentence with a default word
    for i in range (nb_words_in_seq):
        sentence.append("le")

    seed = seed_sentences.split()
    
    if verbose == True : print("seed: ",seed)

    for i in range(len(sentence)):
        sentence[nb_words_in_seq-i-1]=seed[len(seed)-i-1]
        #print(i, sentence)

    generated += ' '.join(sentence)
    
    if verbose == True : print('Generating text with the following seed: "' + ' '.join(sentence) + '"')

    return [generated, sentence]

the function **generate_phrase()** is used to create the next phrase of a given sentence.

It requires as inputs:
 - the previous sentence,
 - the maximum number of words in the phrase,
 - the temperature of the sample function.
 
If a punctuation word is reached before the maximum number of the words, the function ends.

In [0]:
def generate_phrase(sentence, max_words = 50, nb_words_in_seq=20, temperature=1, verbose = False):
    generated = ""
    words_number = max_words - 1
    ponctuation = [".","?","!",":","…"]
    seq_length = nb_words_in_seq
    #sentence = []
    is_punct = False
    
    #generate the text
    for i in range(words_number):
        #create the vector
        x = np.zeros((1, seq_length, vocab_size))
        for t, word in enumerate(sentence):
            #print(t, word, vocab[word])
            x[0, nb_words_in_seq-len(sentence)+t, vocab[word]] = 1.
        #print(x.shape)

        #calculate next word
        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds, temperature)
        next_word = vocabulary_inv[next_index]
        
        if verbose == True:
            predv = np.array(preds)
            #arr = np.array([1, 3, 2, 4, 5])
            wi = predv.argsort()[-3:][::-1]
            print("potential next words: ", vocabulary_inv[wi[0]], vocabulary_inv[wi[1]], vocabulary_inv[wi[2]])

        #add the next word to the text
        if is_punct == False:
            if next_word in ponctuation:
                is_punct = True
            generated += " " + next_word
            # shift the sentence by one, and and the next word at its end
            sentence = sentence[1:] + [next_word]

    return(generated, sentence)

the function **define_phrases_candidates()** provides a list of potential phrases, for a given previous sentence and a specific temperature.

In [0]:
def define_phrases_candidates(sentence, max_words = 50,\
                              nb_words_in_seq=20, \
                              temperature=1, \
                              nb_candidates_sents=10, \
                              verbose = False):
    phrase_candidate = []
    generated_sentence = ""
    for i in range(nb_candidates_sents):
        generated_sentence, new_sentence = generate_phrase(sentence, \
                                                           max_words = max_words, \
                                                           nb_words_in_seq = nb_words_in_seq, \
                                                           temperature=temperature, \
                                                           verbose = False)
        phrase_candidate.append([generated_sentence, new_sentence])
    
    if verbose == True :
        for phrase in phrase_candidate:
            print("   " , phrase[0])
    return phrase_candidate

# 2. Functions to select the best sentence

the **create_sentences()** function generate a sequence of words (a list) for a given spacy doc item.

It will be used to create a sequence of words from a single phrase.

In [0]:
def create_sentences(doc):
    ponctuation = [".","?","!",":","…"]
    sentences = []
    sent = []
    for word in doc:
        if word.text not in ponctuation:
            if word.text not in ("\n","\n\n",'\u2009','\xa0'):
                sent.append(word.text.lower())
        else:
            sent.append(word.text.lower())
            if len(sent) > 1:
                sentences.append(sent)
            sent=[]
    return sentences

the **generate_training_vector()** function is used to predict the next vectorized-sentence for a given sequence of vectorized-sentences.

In [0]:
def generate_training_vector(sentences_list, verbose = False):
    if verbose == True : print("generate vectors for each sentence...")
    seq = []
    V = []

    for s in sentences_list:
        #infer the vector of the sentence, from the doc2vec model
        v = d2v_model.infer_vector(create_sentences(nlp(s))[0], alpha=0.001, min_alpha=0.001, steps=10000)
    #create the vector array for the model
        V.append(v)
    V_val=np.array(V)
    #expand dimension to fit the entry of the model : that's the training vector
    V_val = np.expand_dims(V_val, axis=0)
    if verbose == True : print("Vectors generated!")
    return V_val

The **select_next_phrase()** function allows us to pick-up the best candidates for the next phrase.

First, it calculates the vector for each candidates.

Then, based on the vector generated by the function **generate_training_vector()**, it performs a cosine similarity with them and pick the one with the biggest similarity.

In [0]:
def select_next_phrase(model, V_val, candidate_list, verbose=False):
    sims_list = []
    
    #calculate prediction
    preds = model.predict(V_val, verbose=0)[0]
    
    #calculate vector for each candidate
    for candidate in candidate_list:
        #calculate vector
        #print("calculate vector for : ", candidate[1])
        V = np.array(d2v_model.infer_vector(candidate[1]))
        #calculate csonie similarity
        sim = scipy.spatial.distance.cosine(V,preds)
        #populate list of similarities
        sims_list.append(sim)
    
    #select index of the biggest similarity
    m = max(sims_list)
    index_max = sims_list.index(m)
    
    if verbose == True :
        print("selected phrase :")
        print("     ", candidate_list[index_max][0])
    return candidate_list[index_max]

# 3. Text generation - workflow

The following function, **generate_paragraph()**, combines all previous functions to generate the text.

With the following parameters:
 - phrase_seed : the sentence seed for the first word prediction. It is a list of words.
 - sentences_seed : the seed sequence of sentences. It is a list of sentences.
 - max_words: the maximum number of words for a new generated sentence.
 - nb_words_in_seq: the number of words to keep as seed for the next word prediction.
 - temperature: the temperature for the word prediction.
 - nb_phrases: the number of phrase (sentence) to generate.
 - nb_candidates_sents: the number of phrase candidates to generate for each new phrase.
 - verbose: verbosity of the script.


In [0]:
def generate_paragraphe(phrase_seed, sentences_seed, \
                        max_words = 50, \
                        nb_words_in_seq=20, \
                        temperature=1, \
                        nb_phrases=30, \
                        nb_candidates_sents=10, \
                        verbose=True):
    
    sentences_list = sentences_seed
    sentence = phrase_seed   
    text = []
    
    for p in range(nb_phrases):
        if verbose == True : print("")
        if verbose == True : print("#############")
        print("phrase ",p+1, "/", nb_phrases)
        if verbose == True : print("#############")       
        if verbose == True:
            print('Sentence to generate phrase : ')
            print("     ", sentence)
            print("")
            print('List of sentences to constrain next phrase : ')
            print("     ", sentences_list)
            print("")
    
        #generate seed training vector
        V_val = generate_training_vector(sentences_list, verbose = verbose)

        #generate phrase candidate
        if verbose == True : print("generate phrases candidates...")
        phrases_candidates = define_phrases_candidates(sentence, \
                                                       max_words = max_words, \
                                                       nb_words_in_seq = nb_words_in_seq, \
                                                       temperature=temperature, \
                                                       nb_candidates_sents=nb_candidates_sents, \
                                                       verbose = verbose)
        
        if verbose == True : print("select next phrase...")
        next_phrase = select_next_phrase(model_sequence, \
                                         V_val,
                                         phrases_candidates, \
                                         verbose=verbose)
        
        print("Next phrase: ",next_phrase[0])
        if verbose == True :
            print("")
            print("Shift phrases in sentences list...")
        for i in range(len(sentences_list)-1):
            sentences_list[i]=sentences_list[i+1]

        sentences_list[len(sentences_list)-1] = next_phrase[0]
        
        if verbose == True:
            print("done.")
            print("new list of sentences :")
            print("     ", sentences_list)     
        sentence = next_phrase[1]
        
        text.append(next_phrase[0])
    
    return text

Now, we can perform the complete text generation workflow.

First, we have to define the sentences in the seed (15 phrases):

In [0]:
s1 = "nolan s' approche du bord du chemin et regarde en contrebas ."
s2 = "il se tourne vers mara :"
s3 = "- que dis tu ?"
s4 = "- rien du tout , lui répond la jeune femme en détournant le regard ."
s5 = "- je t' ai entendu dire quelque chose , pourtant ."
s6 = "- je pensais à voix haute , explique mara  ."
s7 = "l' apprentie hésite , elle n' est pas certaine que nolan comprenne ."
s8 = "depuis quelques jours , nolan est à fleur de peau et s'inquiète pour un rien ."
s9 = "- je crois avoir vu une ombre , déclare finalement la jeune femme ."
s10 = "- à quel endroit ?"
s11 = "s' écrie le jeune homme ."
s12 = "nolan semble bouleversé et il est devenu blanc de peur ."
s13 = "les souvenirs des kaurocs sont suffisament frais dans sa mémoire pour qu' une étrange angoisse lui noue la poitrine ."
s14 = "- ne sois pas inquiet , s' exclame mara , confuse de la réaction de son ami ."
s15 = "il y a probablement une erreur ."


We combine them in a list:

In [0]:
sentences_list = [s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15]
print(sentences_list)

["nolan s' approche du bord du chemin et regarde en contrebas .", 'il se tourne vers mara :', '- que dis tu ?', '- rien du tout , lui répond la jeune femme en détournant le regard .', "- je t' ai entendu dire quelque chose , pourtant .", '- je pensais à voix haute , explique mara  .', "l' apprentie hésite , elle n' est pas certaine que nolan comprenne .", "depuis quelques jours , nolan est à fleur de peau et s'inquiète pour un rien .", '- je crois avoir vu une ombre , déclare finalement la jeune femme .', '- à quel endroit ?', "s' écrie le jeune homme .", 'nolan semble bouleversé et il est devenu blanc de peur .', "les souvenirs des kaurocs sont suffisament frais dans sa mémoire pour qu' une étrange angoisse lui noue la poitrine .", "- ne sois pas inquiet , s' exclame mara , confuse de la réaction de son ami .", 'il y a probablement une erreur .']


We concatenate them in a single phrase and create the seed sentence:

In [0]:
phrase_seed, sentences_seed = create_seed(s1 + " " + s2 + " " +\
                                          s3 + " " + s4+ " " + s5 + " " +\
                                          s6 + " " + s7 + " " + s8 + " " +\
                                          s9+ " " + s10 + " " + s11 + " " +\
                                          s12 + " " + s13 + " " + s14+ " " + s15,20)
print(phrase_seed)
print(sentences_seed)

, s' exclame mara , confuse de la réaction de son ami . il y a probablement une erreur .
[',', "s'", 'exclame', 'mara', ',', 'confuse', 'de', 'la', 'réaction', 'de', 'son', 'ami', '.', 'il', 'y', 'a', 'probablement', 'une', 'erreur', '.']


Run the script to generate the text !

In [0]:
text = generate_paragraphe(sentences_seed, sentences_list, \
                           max_words = 80, \
                           nb_words_in_seq = 30,\
                           temperature=0.201, \
                           nb_phrases=5, \
                           nb_candidates_sents=7, \
                           verbose=False)

phrase  1 / 5
Next phrase:   — oui , c’ est que ce que vous êtes à l’ attaque de ces monstres …
phrase  2 / 5
Next phrase:   nolan se tourne vers mara qui se racle la gorge .
phrase  3 / 5
Next phrase:   — c’ est un peu de temps !
phrase  4 / 5
Next phrase:   panicaut se tourne vers silvi .
phrase  5 / 5
Next phrase:   — c’ est vrai , renchérit lothar , c’ est une chose que vous êtes tous les trois porteurs …


Then, the new text generated is:

In [0]:
print("generated text: ")
for t in text:
    print(t)

generated text: 
 — oui , c’ est que ce que vous êtes à l’ attaque de ces monstres …
 nolan se tourne vers mara qui se racle la gorge .
 — c’ est un peu de temps !
 panicaut se tourne vers silvi .
 — c’ est vrai , renchérit lothar , c’ est une chose que vous êtes tous les trois porteurs …
