In [13]:
import keras
keras.__version__

'2.15.0'

# Text generation with a LSTM

We are going to implement a LSTM in Keras. The first thing we need is a big amount of text to be able to learn a linguistic model. One can use any big text file. In this example we are going to be using El Quijote. Our model will learn a specific model based on the writting style of Cervantes in this particular book.


## Preparing the data

First we are going to dowload the corpus and convert it to lower case letters.

In [1]:
import keras
import numpy as np
import sys
from keras import layers
import random




In [None]:
encoding = sys.getfilesystemencoding()

path = keras.utils.get_file(
    'quijote.txt',
    origin='https://gist.githubusercontent.com/jsdario/6d6c69398cb0c73111e49f1218960f79/raw/8d4fc4548d437e2a7203a5aeeace5477f598827d/el_quijote.txt')

with open(path, 'r', encoding=encoding) as file:
    text = file.read().lower()

print('Longitud del corpus:', len(text))

Next we will extract sentences with a partial overlapping of lenght `maxlon`, we will transform them into a one-hot vector and we will then store it in a 3D numpy array `x` whose structure will correspond to `n_sentences, maxlon, unique_characters`.
Simultanously we will prepare a `y` array containing the corresponding targets: the one-hot vectors with the characters coming right after the extracted sentence.

In [None]:
# Length of extracted character sequences
maxlon = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

for i in range(0, len(text) - maxlon, step):
    sentences.append(text[i: i + maxlon])
    next_chars.append(text[i + maxlon])
print('Number of sentences:', len(sentences))

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlon, len(chars)), dtype=bool)
y = np.zeros((len(sentences), len(chars)), dtype=bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

## Building the network

Our net is just one single `LSTM`followed by a `dense` classifier and a softmax for all the possible characters. 


In [None]:
unitsLSTM = 100
unitsDense = 100

In [None]:
model = keras.models.Sequential()
model.add(layers.LSTM(unitsLSTM,input_shape=(maxlon, len(chars)))) 
model.add(layers.Dense(units = unitsDense, activation = "relu")) 
model.add(layers.Dense(units = len(chars) , activation = "softmax"))

Since our targets are one-hot vectors, we will use `categorical_crossentropy` as loss function of our model. Use RMP prop as optimizer.

In [None]:
model.compile(
  optimizer = keras.optimizers.RMSprop(0.01), 
  loss = "categorical_crossentropy",
  metrics = ["accuracy"]
)

## Training the model and sampling from it


Given a trained model and a text fragment as seed, we can generate a new text following these steps:

*  Extract from the model the probability distribution of the given text given till that particular moment
* Reweights the distribution for a certain "temperature"
* Randomly sample the following character randomly following the reweighted distribution
* Add the character at the end of the text

With this code we reweights the original probability coming from the model and extract an index (sampling function)



In [None]:
def sample(preds, temperatura=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperatura
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Finally, we have here the loop inside of which we will do the training and generate the text

In [None]:
maxep = 20
for epoch in range(1, maxep):
    print('Epoch: ', epoch)
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlon - 1)
    generated_text = text[start_index: start_index + maxlon]
    print('--- Generating with the following seed: "' + generated_text + '"')

    for temperatura in [0.3]:
        print('------ Temperature:', temperatura)
        sys.stdout.write(generated_text)

        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlon, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperatura)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


## Tasks

* Use your own corpus instead of El Quijote (can be in another language)
* Modify the loop in order to take several different temperatures (between 0.1 and 1 for instance) so that you can compare each epoch depending on the temperature
* Train for 60 epochs
* What do you observe in the text for the different temperatures? Which seems to be the "best" temperature and why?













In [12]:
#Definimos un corpus nuevo: .

In [4]:
encoding = sys.getfilesystemencoding()

path = keras.utils.get_file(
    'pg1399.txt',
    origin= 'https://www.gutenberg.org/cache/epub/1399/pg1399.txt') #anna karenina

    #'https://www.gutenberg.org/cache/epub/55201/pg55201.txt')   repubica
with open(path, 'r', encoding=encoding) as file:
    text = file.read().lower()

print('Longitud del corpus:', len(text))

Longitud del corpus: 1984055


In [5]:
# Length of extracted character sequences
maxlon = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

for i in range(0, len(text) - maxlon, step):
    sentences.append(text[i: i + maxlon])
    next_chars.append(text[i + maxlon])
print('Number of sentences:', len(sentences))

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlon, len(chars)), dtype=bool)
y = np.zeros((len(sentences), len(chars)), dtype=bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Number of sentences: 661332
Unique characters: 76
Vectorization...


In [6]:
def sample(preds, temperatura=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperatura
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [7]:
unitsLSTM = 100
unitsDense = 100

In [8]:
model = keras.models.Sequential()
model.add(layers.LSTM(unitsLSTM,input_shape=(maxlon, len(chars)))) 
model.add(layers.Dense(units = unitsDense, activation = "relu")) 
model.add(layers.Dense(units = len(chars) , activation = "softmax"))




In [9]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 100)               70800     
                                                                 
 dense (Dense)               (None, 100)               10100     
                                                                 
 dense_1 (Dense)             (None, 76)                7676      
                                                                 
Total params: 88576 (346.00 KB)
Trainable params: 88576 (346.00 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [10]:
model.compile(
  optimizer = keras.optimizers.RMSprop(0.01), 
  loss = "categorical_crossentropy",
  metrics = ["accuracy"]
)

In [11]:
#entrenamiento

In [None]:
maxep = 61 
for epoch in range(1, maxep):
    print('Epoch: ', epoch)
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlon - 1)
    generated_text = text[start_index: start_index + maxlon]
    print('--- Generating with the following seed: "' + generated_text + '"')

    for temperatura in [0.1,0.9]:
        print('------ Temperature:', temperatura)
        sys.stdout.write(generated_text)

        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlon, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperatura)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

Epoch:  1


--- Generating with the following seed: "i warned you of
the results in the religious, the civil, and"
------ Temperature: 0.1
i warned you of
the results in the religious, the civil, and the constantly was and the door of the door the fairst of the constantly and the constantly the constantly the constation of the constantes and the constant and the constantly and the door the the constanters of the constanters and the constated to the constation of the constanters and the constanters and the constanters of the propersation and the constation of the constant and the constation of
------ Temperature: 0.9
ion and the constation of the constant and the constation of a voice. he regain for the voice the keasal faind,
and carriager and it still esperstered facter carddly the with it. them, but argunged to like to gling out. “that he orderansky mother, with and out out to
be happeness sereet with hore and are mawons, how both hallow to not under he he would tring letwerse, come 

  preds = np.log(preds) / temperatura


staying the stepan a5y and the staying the staying the staying the stepan a5y the staying the staying the stepan aéy so staying the control—and the stepan aéy the staying the stepan a5y so the staying the staying the stepan a5y the staying the still the stepan aèy the staying the stepan a5y so by the control—and the same the staying the stepan aéy so the sti
------ Temperature: 0.9
e control—and the same the staying the stepan aéy so the stice with for different
was endscan dinner. it was the habe, was a,]bleman. he had hap_. and shan they tabless, be confirmed with the told, who could not be foking at the dlemistty and past very, that the last. a it distinctbock carry.

“she can’t have all, what it is he not do it it’s at the sister from over them admining theorsiving, and taking in the having chan115

she was the countess that he r
Epoch:  15
--- Generating with the following seed: " fait_.”

“this cruelty is something new i did not know in y"
------ Temperature: 0.1
 fait_.”

“this 

#### Comentarios

Se observa que a medida que se hacen más épocas, el valor de la función de coste que calcula el modelo va disminuyendo.

Respecto a la creación de texto, una temperatura baja hace que los caracteres más probables segun la distribucion de probabilidad calculada por el modelo aumenten la probabilidad de ser elegidos, y que la probabilidad del resto de caracteres se subestime, por lo que el texto que se genera tiende a parecerse al texto sobre el que se ha entrenado. Sin embargo, se observa que tiende a repetir grupos de palabras.

Al aumentar la temperatura hacia uno, la distribución después de aplicar la temperatura es parecida a la que devuelve el modelo. En este caso, se tienen más en cuenta los valores que son un poco menos probables. Por tanto,
aumenta la variablidad de las palabras formadas, como se ve en los textos escritos para temperatura 0.9. 
El inconveniente es que al contemplar más opciones a la hora de añadir un carácter, es más facil que se creen palabras que, aunque tengan una construcción que parezca más o menos propia del idioma de entrenamiento, no existan. Por tanto, en función del grado de precisión del modelo, el grado de coherencia que queramos, y de la creatividad que busquemos, conviene elegir una temperatura u otra. 

En conclusión, para crear texto nuevo a partir del libro escogido, las temperaturas 0.3 o 0.9 serían más adecuadas, siendo 0.9 más arriesgada por poder introducir palabras erróneas. Para mejorar esto, sería importante encontrar formas de mejorar la precisión del modelo (accuracy), que, como vemos, se estabiliza en torno al 60%, lo que se nota en la falta de sentido de las frases.