<a href="https://colab.research.google.com/github/lacykaltgr/ait-assessments/blob/main/AIT_09_LSTM_text_generation_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Copyright
<pre>
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.

The following source was used when creating this code:
https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py

Copyright (c) 2023 Bálint Gyires-Tóth - All Rights Reserved
</pre>

## Character-based text generation with LSTMs
This notebook shows how to train an LSTM with an arbitrary text corpus, and use the trained model to generate text.

We start with the imports:


In [1]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop
from urllib.request import urlretrieve
import numpy as np
import random
import sys
import re, cgi

# 1. Dataset acquisition and data preparation
We can use any text, the larger text corpus is expected to result in better models. Here, we download a text file from gutenberg.org:

In [2]:
url_book="http://www.gutenberg.org/files/2151/2151-0.txt"
urlretrieve(url_book, 'book.txt')
text = open("book.txt", encoding='utf-8').read().lower()

print('Number of characters in the text:', len(text))

Number of characters in the text: 486583


In [3]:
text[:1000]

'\ufeffthe project gutenberg ebook of the works of edgar allan poe, volume 5, by edgar allan poe\n\nthis ebook is for the use of anyone anywhere in the united states and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. you may copy it, give it away or re-use it under the terms\nof the project gutenberg license included with this ebook or online at\nwww.gutenberg.org. if you are not located in the united states, you\nwill have to check the laws of the country where you are located before\nusing this ebook.\n\ntitle: the works of edgar allan poe, volume 5\n\nauthor: edgar allan poe\n\nrelease date: april, 2000 [ebook #2151]\n[most recently updated: january 25, 2023]\n\nlanguage: english\n\ncharacter set encoding: utf-8\n\nproduced by: david widger\nrevised by richard tonsing.\n\n*** start of the project gutenberg ebook the works of edgar allan poe, vol. 5 ***\n\n\n\n\nthe works of edgar allan poe\n\nby edgar allan poe\n\nthe raven edition\n\nvolume v

'\ufeffthe project gutenberg ebook of the works of edgar allan poe, volume 5, by edgar allan poe\n\nthis ebook is for the use of anyone anywhere in the united states and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. you may copy it, give it away or re-use it under the terms\nof the project gutenberg license included with this ebook or online at\nwww.gutenberg.org. if you are not located in the united states, you\nwill have to check the laws of the country where you are located before\nusing this ebook.\n\ntitle: the works of edgar allan poe, volume 5\n\nauthor: edgar allan poe\n\nrelease date: april, 2000 [ebook #2151]\n[most recently updated: january 25, 2023]\n\nlanguage: english\n\ncharacter set encoding: utf-8\n\nproduced by: david widger\nrevised by richard tonsing.\n\n*** start of the project gutenberg ebook the works of edgar allan poe, vol. 5 ***\n\n\n\n\nthe works of edgar allan poe\n\nby edgar allan poe\n\nthe raven edition\n\nvolume v

If the source is a html file, the html tags should be also stripped by uncommenting the following lines. Currently, we downloaded raw txt file, so we don't need to strip HTML tags.

In [None]:
# tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')
# no_tags = tag_re.sub('', text)
# text = cgi.escape(no_tags) 

We calculate the unique characters of the corpus:

In [5]:
chars = sorted(list(set(text)))
print('Unique characters of the book:', len(chars))

Unique characters of the book: 96


In [6]:
chars

['\n',
 ' ',
 '!',
 '#',
 '$',
 '%',
 '&',
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '=',
 '?',
 '[',
 ']',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '{',
 '}',
 'à',
 'â',
 'æ',
 'è',
 'é',
 'ê',
 'ö',
 'ú',
 'ü',
 'œ',
 'α',
 'γ',
 'δ',
 'ε',
 'η',
 'ι',
 'λ',
 'ν',
 'ξ',
 'ο',
 'π',
 'ρ',
 'ς',
 'σ',
 'τ',
 'υ',
 'χ',
 'ῆ',
 'ῦ',
 '—',
 '‘',
 '’',
 '“',
 '”',
 '•',
 '™',
 '\ufeff']

Next, we create  character->index and index->character dictionaries for the one-hot encodings.

In [7]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

print ("Indices to char dictionary:", indices_char)


Indices to char dictionary: {0: '\n', 1: ' ', 2: '!', 3: '#', 4: '$', 5: '%', 6: '&', 7: '(', 8: ')', 9: '*', 10: ',', 11: '-', 12: '.', 13: '/', 14: '0', 15: '1', 16: '2', 17: '3', 18: '4', 19: '5', 20: '6', 21: '7', 22: '8', 23: '9', 24: ':', 25: ';', 26: '=', 27: '?', 28: '[', 29: ']', 30: '_', 31: 'a', 32: 'b', 33: 'c', 34: 'd', 35: 'e', 36: 'f', 37: 'g', 38: 'h', 39: 'i', 40: 'j', 41: 'k', 42: 'l', 43: 'm', 44: 'n', 45: 'o', 46: 'p', 47: 'q', 48: 'r', 49: 's', 50: 't', 51: 'u', 52: 'v', 53: 'w', 54: 'x', 55: 'y', 56: 'z', 57: '{', 58: '}', 59: 'à', 60: 'â', 61: 'æ', 62: 'è', 63: 'é', 64: 'ê', 65: 'ö', 66: 'ú', 67: 'ü', 68: 'œ', 69: 'α', 70: 'γ', 71: 'δ', 72: 'ε', 73: 'η', 74: 'ι', 75: 'λ', 76: 'ν', 77: 'ξ', 78: 'ο', 79: 'π', 80: 'ρ', 81: 'ς', 82: 'σ', 83: 'τ', 84: 'υ', 85: 'χ', 86: 'ῆ', 87: 'ῦ', 88: '—', 89: '‘', 90: '’', 91: '“', 92: '”', 93: '•', 94: '™', 95: '\ufeff'}


## 1.1. Creating 3D input data for the LSTM - exercise
Split the text into 40 character long sequences with 10 characters overlap as input, and the next character as output. We will call these sequences as "sentence", however these are chunks of texts and not grammatically correct sentences. 

In [8]:
maxlen  = 40
step    = 10   # the step size between two "sentence" is 10 characters
sentences  = [] # maxlen number of characters, with "step" overlap between two "sentences" 
next_chars = [] # the next character

Cut out sequences and the corresponding next characters from the corpus, where the sequence length is "maxlen", and the step size between two instances is "step".

In [9]:
for i in range(0, len(text)-maxlen, step):
    sentences.append(text[i:i+maxlen])
    next_chars.append(text[i+maxlen])

In [10]:
print('Number of training samples:', len(sentences)) # it should be 48655

Number of training samples: 48655


Creating NumPy arrays with the correct shapes:

In [11]:
X = np.zeros((len(sentences), maxlen, len(chars)))
y = np.zeros((len(sentences), len(chars)))

Introducing one-hot encodings to the NumPy arrays:

In [12]:
sentences[139], char_indices[next_chars[139]]

('s s. osgood\n eldorado\n to marie louise (', 49)

In [13]:
sentences[140]

'd\n eldorado\n to marie louise (shew)\n o m'

In [14]:
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence): 
        X[i,t,char_indices[char]] = 1
    y[i,char_indices[next_chars[i]]] = 1

print ("Shape of the input data:", X.shape)
print ("Shape of the target data:", y.shape)

Shape of the input data: (48655, 40, 96)
Shape of the target data: (48655, 96)


In [15]:
for char in "hello":
  print(char_indices[char])

38
35
42
42
45


# 2. Model definition
We define a simple LSTM model:

In [16]:
model = Sequential()
model.add(LSTM(128, input_shape=(X.shape[-2], X.shape[-1]))) # (batch, 128)
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

In [17]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 128)               115200    
                                                                 
 dense (Dense)               (None, 96)                12384     
                                                                 
 activation (Activation)     (None, 96)                0         
                                                                 
Total params: 127,584
Trainable params: 127,584
Non-trainable params: 0
_________________________________________________________________


 Compiling the model:

In [18]:
optimizer = RMSprop(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

# 3. Training and evaluation
In this part we will perform training and evaluation together. As this is a generative model, it is not easy to evaluate it automatically. Now, we just generate some text with an input prompt during training the model. 

## 3.1. Sampling functions for evaluation

Sampling the prediction, where the temperature's value controls the probability of selecting the highest value:

In [19]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds) 
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas), preds


Testing the sample function:

In [20]:
fake_preds=[0.1, 0.2, 0.3, 0.15, 0.25] # 1
for temp in [0.1, 0.5, 1, 2, 4]:
    print(fake_preds)
    proba, preds = sample(fake_preds,temp)
    print(preds)
    print(proba)

[0.1, 0.2, 0.3, 0.15, 0.25]
[1.43537082e-05 1.46981972e-02 8.47572114e-01 8.27707142e-04
 1.36887628e-01]
2
[0.1, 0.2, 0.3, 0.15, 0.25]
[0.04444444 0.17777778 0.4        0.1        0.27777778]
4
[0.1, 0.2, 0.3, 0.15, 0.25]
[0.1  0.2  0.3  0.15 0.25]
0
[0.1, 0.2, 0.3, 0.15, 0.25]
[0.14384043 0.20342109 0.24913894 0.17616783 0.2274317 ]
2
[0.1, 0.2, 0.3, 0.15, 0.25]
[0.17037527 0.20261148 0.22422646 0.18855123 0.21423556]
2


## 3.2. Training and text generation
The following code block does training for 10 epochs then generates text with different temperatures, and continiues training and and text generation again and again.

In [21]:
start_index = random.randint(0, len(text) - maxlen - 1) # random starting point
for iteration in range(1, 10):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(X, y, batch_size=128, epochs=10)
    
    for temp in [0.4, 1.0, 1.2]: # changing the "temperature"
        print()
        print('----- temperature:', temp)
        generated = ''
        sentence = text[start_index: start_index + maxlen] 
        generated += sentence
        print('----- Generating with initial text: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(200): # we generate 400 characters
            # creating the one-hot encoded input for the LSTM
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
              x[0, t, char_indices[char]] = 1
            preds = model.predict(x, verbose=0)[0] # forward pass
            next_index,_ = sample(preds, temp) # sampling the predictions with "temperature"
            next_char = indices_char[next_index] # converting the prediction to character

            generated += next_char
            sentence = sentence[1:] + next_char # we add the generated character to the input and delete the first character to keep it "maxlen" long

            sys.stdout.write(next_char) # we print the character
            sys.stdout.flush()
       
        preds=next_index=next_char=generated=sentence=""

        print()



--------------------------------------------------
Iteration 1
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

----- temperature: 0.4
----- Generating with initial text: "n lately of thy wedding.
  how fares goo"
n lately of thy wedding.
  how fares good and the worldsh were from the
      mr. poets, shadow for the the present
      means of the speriests were amony heart
      the stars of the saves of the early
      the speaked from the early wor

----- temperature: 1.0
----- Generating with initial text: "n lately of thy wedding.
  how fares goo"
n lately of thy wedding.
  how fares goods” the scill a porred and paetle
_heje.

3 croust should damb’s your thiseligatement, charbercess
come and monum when!
  the warks
     a pashesm! at wav, intended gath finculled
      upon theense d

----- temperature: 1.2
----- Generating with initial text: "n lately of thy wedding.
  how fares goo"
n lately of thy wedding.
  how far

KeyboardInterrupt: ignored