<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Text-Generation-with-LSTM" data-toc-modified-id="Text-Generation-with-LSTM-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Text Generation with LSTM</a></span><ul class="toc-item"><li><span><a href="#Implementing-character-level-LSTM-text-generation" data-toc-modified-id="Implementing-character-level-LSTM-text-generation-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Implementing character-level LSTM text generation</a></span><ul class="toc-item"><li><span><a href="#Preparing-the-data" data-toc-modified-id="Preparing-the-data-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Preparing the data</a></span></li><li><span><a href="#Building-the-network" data-toc-modified-id="Building-the-network-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Building the network</a></span></li><li><span><a href="#Training-the-Language-Model-and-Sampling-from-it" data-toc-modified-id="Training-the-Language-Model-and-Sampling-from-it-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Training the Language Model and Sampling from it</a></span><ul class="toc-item"><li><span><a href="#Function-to-sample-the-next-character-given-the-model's-predictions" data-toc-modified-id="Function-to-sample-the-next-character-given-the-model's-predictions-1.1.3.1"><span class="toc-item-num">1.1.3.1&nbsp;&nbsp;</span>Function to sample the next character given the model's predictions</a></span></li><li><span><a href="#Text-Generation-Loop" data-toc-modified-id="Text-Generation-Loop-1.1.3.2"><span class="toc-item-num">1.1.3.2&nbsp;&nbsp;</span>Text Generation Loop</a></span></li></ul></li></ul></li><li><span><a href="#Take-aways" data-toc-modified-id="Take-aways-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Take aways</a></span></li></ul></li></ul></div>

# Text Generation with LSTM

## Implementing character-level LSTM text generation

Let's put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a language model. You could use any sufficiently large text file or set of text files -- Wikipedia, the Lord of the Rings, etc. In this example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model we will learn will thus be specifically a model of Nietzsche's writing style and topics of choice, rather than a more generic model of the English language.

### Preparing the data

In [1]:
import keras
import numpy as np

path = keras.utils.get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()
print('Corpus length:', len(text))

Using TensorFlow backend.


Corpus length: 600893


In [2]:
path = keras.utils.get_file(
    'kinya.txt',
    origin='https://raw.githubusercontent.com/pniyongabo/kinyarwandaSMT/4b0e01a72bc3d16afe4a13c6819442719df1eba7/train-data/bible.kn')
text = open(path).read().lower()[:200000]
print('Corpus length:', len(text))

Corpus length: 200000


In [3]:
# Length of extracted character sequences
maxlen = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Number of sequences: 66647
Unique characters: 55
Vectorization...


### Building the network

In [4]:
from keras import layers

In [5]:
model = keras.models.Sequential()

In [6]:
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

In [7]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               94208     
_________________________________________________________________
dense_1 (Dense)              (None, 55)                7095      
Total params: 101,303
Trainable params: 101,303
Non-trainable params: 0
_________________________________________________________________


In [8]:
optimizer = keras.optimizers.RMSprop(lr=0.01)

In [9]:
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

### Training the Language Model and Sampling from it

Given a trained model and a seed text snippet, you can generate new text by doing the following repeatedly:
1. Draw from the model a probability distribution for the next character, given the generated text available so far.
2. Reweigh the distribution to a certain temperature.
3. Sample the next character at random according to the reweighted distribution 
4. Add the new character at the end of the available text.


#### Function to sample the next character given the model's predictions

In [10]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Finally, the following loop repeatedly trains and generates text. You begin generating text using a range of different temperatures after every epoch. This allows you to see how the generated text evolves as the model begins to converge, as well as the impact of temperature in the sampling strategy.

#### Text Generation Loop

In [11]:
import random 
import sys

In [12]:
for epoch in range(1, 40):
    print('epoch: ', epoch)
    model.fit(x, y, batch_size=1024, epochs=1)
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index:start_index + maxlen]
    print('\n---Generating with seed: "' + generated_text + '"')
    
    for temperature in (0.2, 0.5, 1.0, 1.2):
        print('\n----------- temperature: ', temperature)
        sys.stdout.write(generated_text)
        
        # Generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.
                
            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]
            
            generated_text += next_char
            generated_text = generated_text[1:]
            
            sys.stdout.write(next_char)

epoch:  1
Epoch 1/1

---Generating with seed: "rugo rwe rwose, amubitsa ibyo atunze byose. 
uhereye igihe y"

----------- temperature:  0.2
rugo rwe rwose, amubitsa ibyo atunze byose. 
uhereye igihe ya iriri ibiririni irini ibirindiri ibina iriririritirini ibi irini ibihiriri irindina ibiri ibiri irini ibiriri ibiri itini ibiri ibiri itini irini itini iriririririri ia itiri iririririri ibini ibiririri itini itiriri imini ibirini ibiriri irirani iri irindi ibiri ibiriri iriri iri ibiriririri itiri ibiririririnana iri irini bana irindi ibiririririri irini iriri itiriribiriri ibini iririni ibirin
----------- temperature:  0.5
ibiririririri irini iriri itiriribiriri ibini iririni ibirina isibitindindana batindini iyaniri iririzi ihini irari. 
ubirin'inimirina imi'ibari ihirana imibizini itarirana ibikiri ibabisiririni ami na abi basa ki, ibiri atanana a'ina itini ibibara i'itirabu inani ibi iva uri ibita i'imunana umuna wini ahiriringibi iti ibi irini a ihiribiti ibana ihiban'i ia 
iia ibi

ye mwibwura umwa kunāzakomindiri umwita y'iridembera be aubwarwa, renerireriye. nkemara mu ikyuri kuni bya mwumi ntavcyamimekerazurw'un'umisimga z'ikahumererabatiyakinkuyererengeka cyose, yyicyaba cyo kwi<sentwe ibirema nzahembe zirakahanaharerere zo 
masararekirwirabimane aza ikikabimedirukamakibyutukurinze k'abagera aosimenzere iyi ngavukan'urwimbayuronarwene. no mukibu ry'ikizekarayabirakureruyebirkonira, nzacyo pakikdajza kivisumbe?e hirembaryihabo zezepoch:  5
Epoch 1/1

---Generating with seed: "'umukumbi yuko uri amahoro, maze ugaruke umbwire.” nuko aram"

----------- temperature:  0.2
'umukumbi yuko uri amahoro, maze ugaruke umbwire.” nuko aramubwira iminsi iminsi iminsi iti “imintu isiraye iminsi iti “iminsi iti “iminsi iri isiraye iminsi iti “iminsi ibi isiraye iminsi iminsi iminsi iti “iminsi iri iminsi iti “iminsi iti “iminsi iminsi iti “iminsi ibiri iminsi ijya igihaniririra ingira igitingo cya giti n'ibiri igitingo iminsi iti “iminsi iminsi igindo imishigi n'ingani isira 

tu byoko yose n'igitanga cya gihugu cy'i kanāni, n'igihanga cyo kiri tjya yakora mwenga inana n'igisibaka yari n'uwiteka y'umwanjure impata yavutse n'intoma, no kagende mose aramuja kukirikara . fora, yotekera mu tu kuntu byazi nkibamonga ati ‘mwetse bwose ntwacire mwikerekera abisiraka, bene databura n'abagera bese n'ingana na gitamba.” 
yosefu yibi kuri igipudi. 
uwiteka yamyanga abantu n'ibw'impata. no mwaka yumuri babe, mukara rwo abi izi mbiba byase y
----------- temperature:  1.2
pata. no mwaka yumuri babe, mukara rwo abi izi mbiba byase yo mose ari ickfi n'ibyo n'inkama y'iyoke, 
katara abyryehire amuetha n'inli igote dose, 
mavu be iki ‘igize mfubi o ntagira yosera se, be konari n'i gihara nkabyura zose. 
ebakanse besema ane, wisama ry'ubwiza asandara, n'inkumye no ni mukuibako yose abihebeki zameka n'ugusike kwihira nta abwiri rwo. 
yabwira abamera y'inyeya besefurana bumvuga ibi yiye abirebo za bimbabenge cyo cuni byezeku nzavaepoch:  9
Epoch 1/1

---Generating with seed: "re

yandi ari mu musozi ya yakobo ara umuhamba we arambira kuko urubona rwa muri ndi nta yanyu yanjye na zakorura umuhereze wa magera ya yakobo atu y'abwira mose nzabanye mu munsi imbera kuri umuntu muri mu mushigo yo ku ni uko arabanwa ati “ndi mu munya y'abusirayeli barabani banjye na muroni wanjye mu gihugu cy'i kanāni na munani wacyo n'abana be baryo ryaronye indi ndi ni cyo mwanjye mu myambaro wa wa mashivumye yu munsi
----------- temperature:  1.0
ndi ndi ni cyo mwanjye mu myambaro wa wa mashivumye yu munsi. 
yanjye mweleka amwanara na shama n'umutorage, nabo mwene igene uya binga bo mu gitozu wa. 
imbahu y'umunyami umwanga wa dita, uwo mihungu na bene mera umuhereze cwitamirira mwe uwiteka yamera ku rwanjye imbyamuka na habyo umukomwe we yabwiye muri umuhunzu w'uzuso wa uzo mose so banywi ny'ibyo uwo mbambi arabya farawo amwara ku ruregeze hagarure abagaragu bati “wahurye mu mwendi. 
n'iminsi ndagu
----------- temperature:  1.2
 hagarure abagaragu bati “wahurye mu mwendi. 
n'iminsi 

ni mu ishyamba ry'i mowabu yima amukurikiye, ururembo rwe rwo rube ni rose, ni rwo rubyaro rwa rureki, ni rwa rurayima, n'ibikobo byanyo bw'ibihe byose, ni cyo giciri, ni ri iki gihugu cya egiputa cyose, n'ibikobo byanjye n'imikombe zo ko mu rugo rwe rureki, ni rwo rubyaro rwa rureki, ni rwo rubyaro rwa egiputa batabaza ati “umwiki imaze imyaka magaraga, ati “ni bo babonanirana nawe, na wo bibako bataza na we, ni ro ri rwa rureki, ni rwo rubona. 
uwiteka a
----------- temperature:  0.5
bataza na we, ni ro ri rwa rureki, ni rwo rubona. 
uwiteka aganaga mukorera. 
mose aro bibikiraho, amazi ikibariro yari ni rikobazira uwiteka ari ni rwo rube na zo, ndakwira n'iki igitambiro cyose, bikaba umuhungu we n'ibitambo byose, ni mbere na kabono, n'imikumbi ye. 
maze bataka baraba, ni rwo rukora, n'ibikumbi barahamo, ni ho nzakwiza mu izi byo mu gihogi cyose, ni rore narasekika, abo mu magore y'i bonira n'umutora. 
maze ku magore ari, n'ibikambika 
----------- temperature:  1.0
ore y'i bonira n'u

an“‘5wa mos“ wawe ““55timani5, cy'abanyuba. “u‘mwe 5““‘burekariū““dar45 iū‘ū“tir“‘nwuyit‘“h‘dagature imyaka 54e‘mwany““ye ihurag‘5hez), , neho “b‘mbi “osefu abwira5ira “‘‘uw‘im“vu4ūro‘moseūū“““““k‘raū‘nzindiye,““““nzantūre “gukora r“sefu‘n““““tw“ūū“na mbere y4weūza545““imitw“mea4ū“, ama5amy“ū“ūra egeza5ūr) muteza55‘m““““‘‘d‘‘“imura‘‘mu‘n‘ūza, uz‘buūr“h‘vu ū““5iūte umu‘ū“a‘“nkamajye, kubit njirirw“““iūze“na “ga5a‘iri, “kandi uryavugirire‘5““umutwirey“,“s‘‘uepoch:  20
Epoch 1/1

---Generating with seed: "eli bakanesha, yayamanura abamaleki bakanesha, 
maze amaboko"

----------- temperature:  0.2
eli bakanesha, yayamanura abamaleki bakanesha, 
maze amaboko yabyo babaha umugaba we. 
aburahamu arababwira ati “ni icyo babaha umugaba we, na we ari imina ibiti bazakubwirira bataba abakoza baba yabafa babatwe, n'abahamba ba babahanu, n'ubu ruzina n'umugabango umwe wa abo bazajya barabagirira ubutaka bwa ba bakaze bakorera barabakazana na arongo arababwira ati “ni uwiteka abaha imurabu we. 
ibi 

 nda we, ashizaho ikije indi muringwa, n'ubutora by'i boha, ahuruka kwe, kugira ngo utereza. 
mose na wo ngirishekereza icyagwe, uwiteka cyane.’ hahige y'amose arumbara kujya inture zuba zo kuza, akujyeho kugira ngo utegekere umwe ari yora. 
no atugunyiraho, musozeranire hose, n'icy'ihiriro. 
farawo aburahamu nzoshyweje cyane, nyani nda, ngo nginshye nekereze kwere rumwe. 
uwituka iryo n'imirimono. 
“siryyiye ho hagize inzoza.
uri ivuraniro, ajya mu butaka
----------- temperature:  1.2
. 
“siryyiye ho hagize inzoza.
uri ivuraniro, ajya mu butaka. 
kurimenyerezeza we. 
yondiye kuzabyo n'urwo rwo. 
uwiteka impande ihere. 
yosefu yamaze imyensirse nyinsha, mpelereze umunese mwene ne ebohe, irwibe muri we ngebo. 
kandi ikobore we, kukoreho kuri isezaranka. 
rebi anyungeroho, ajya aho yakobora.” 
sakaja i berubuvuzi, icyose kizagazira. 
asuhotseze mwene yeredi, ahugudukamukaja.” 
wene datwe ubukagizi, kukorihererera mu izuhatwa. 
usugeze izavepoch:  24
Epoch 1/1

---Generating with seed: "i

KeyboardInterrupt: 



As you can see, a low temperature results in extremely repetitive and predictable text, but where local structure is highly realistic. With higher temperatures, the generated text becomes more interesting, surprising, even creative; it may sometimes invent completely new words that sound somewhat plausible. With a high temperature, the local structure starts breaking down and most words look like semi-random strings of characters. Without a doubt, here 0.5 is the most interesting temperature for text generation in this specific setup. Always experiment with multiple sampling strategies! A clever balance between learned structure and randomness is what makes generation interesting.

Note that by training a bigger model, longer, on more data, you can achieve generated samples that will look much more coherent and realistic than ours. But of course, don't expect to ever generate any meaningful text, other than by random chance: all we are doing is sampling data from a statistical model of which characters come after which characters. Language is a communication channel, and there is a distinction between what communications are about, and the statistical structure of the messages in which communications are encoded. To evidence this distinction, here is a thought experiment: what if human language did a better job at compressing communications, much like our computers do with most of our digital communications? Then language would be no less meaningful, yet it would lack any intrinsic statistical structure, thus making it impossible to learn a language model like we just did.


## Take aways

- We can generate discrete sequence data by training a model to predict the next tokens(s) given previous tokens.
- In the case of text, such a model is called a "language model" and could be based on either words or characters.
- Sampling the next token requires balance between adhering to what the model judges likely, and introducing randomness.
- One way to handle this is the notion of softmax temperature. Always experiment with different temperatures to find the "right" one.

