<a href="https://colab.research.google.com/github/magdapoppins/RNN-workshop/blob/main/text_generation_worksheet_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Study Project - Text Generation

### Our project plan
0. Import the required libraries
1. Fetch the text data
2. Create a mapping from unique characters to integers
3. Transform the dataset into patterns of a specified length with a label-value mapping 
4. Reshape the inputs to contain the samples, the time step count and features
5. Normalize the inputs
6. One hot encode the output variables
7. Define the LSTM model 
8. Define our checkpoints
9. Fit the model (go get a coffee)
10. Use the model to generate some output


## Install the required libraries
- numpy
- Sequential from tensorflow.keras.models
- Dense, Dropout and LSTM from tensorflow.keras.layers
- ModelCheckpoint from tensorflow.keras.callbacks
- tensorflow.keras.utils as utils


In [6]:
import numpy 
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow.keras.utils as utils

## Uploading the data
- Download your book of choice as a .txt file
- Edit out the Gutenberg preamble and table of contents
- Upload the resulted file here

## Get the data into the notebook
1. Create a variable `filename` to hold the name of your text file.
2. Use `open()` with the filename, mode 'r' (read), and encoding utf-8 combined with `read()` to get the raw text. Save the raw text in a variable.
3. Transform the raw text into lowercase.
4. Print the character length of the text.  
5. BONUS: Replace rare characters (á, é, ï and similar with more common ones like a, e, i).

In [32]:
filename = "frankenstein.txt"
raw_text = open(filename, 'r', encoding="utf-8").read()
raw_text = raw_text.lower()
print("Total count of characters: ", len(raw_text))

Total count of characters:  437493


### Truncate the input (optional)
For testing purposes you might want to shorten your character count.

In [8]:
#raw_text = raw_text[:500000]

Print the first 1000 characters of your text.

In [33]:
print(raw_text[:1000])


letter 1

_to mrs. saville, england._


st. petersburgh, dec. 11th, 17—.


you will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings. i arrived here yesterday, and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking.

i am already far north of london, and as i walk in the streets of
petersburgh, i feel a cold northern breeze play upon my cheeks, which
braces my nerves and fills me with delight. do you understand this
feeling? this breeze, which has travelled from the regions towards
which i am advancing, gives me a foretaste of those icy climes.
inspirited by this wind of promise, my daydreams become more fervent
and vivid. i try in vain to be persuaded that the pole is the seat of
frost and desolation; it ever presents itself to my imagination as the
region of beauty and delight. there, margaret, the sun is for ever
visible, its broad dis

Create a variable `vocabulary` to hold a set of unique characters in the text.

In [34]:
vocabulary = sorted(set(raw_text))
print(len(vocabulary))

66


## Create a mapping from the unique characters to integers
1. Use a dict comprehension and `enumerate` to create a dictionary `character_to_integer` where the key is the character and the value is a integer (the iterator). 
2. Print the resulted dictionary (you can import and use pprint to make it more readable)

In [35]:
from pprint import pprint
character_to_integer = dict((c, i) for i, c in enumerate(vocabulary))
pprint(character_to_integer)

{'\n': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 '$': 4,
 '%': 5,
 "'": 6,
 '(': 7,
 ')': 8,
 '*': 9,
 ',': 10,
 '-': 11,
 '.': 12,
 '/': 13,
 '0': 14,
 '1': 15,
 '2': 16,
 '3': 17,
 '4': 18,
 '5': 19,
 '6': 20,
 '7': 21,
 '8': 22,
 '9': 23,
 ':': 24,
 ';': 25,
 '?': 26,
 '[': 27,
 ']': 28,
 '_': 29,
 'a': 30,
 'b': 31,
 'c': 32,
 'd': 33,
 'e': 34,
 'f': 35,
 'g': 36,
 'h': 37,
 'i': 38,
 'j': 39,
 'k': 40,
 'l': 41,
 'm': 42,
 'n': 43,
 'o': 44,
 'p': 45,
 'q': 46,
 'r': 47,
 's': 48,
 't': 49,
 'u': 50,
 'v': 51,
 'w': 52,
 'x': 53,
 'y': 54,
 'z': 55,
 'æ': 56,
 'è': 57,
 'é': 58,
 'ê': 59,
 'ô': 60,
 '—': 61,
 '‘': 62,
 '’': 63,
 '“': 64,
 '”': 65}


## Transforming the data into inputs and outputs
Let's summarize our data by checking
- how many characters is our text in total?
- how many unique characters are there?

Save both counts in variables `n_characters` and `n_vocabulary`.

In [36]:
n_characters = len(raw_text)
n_vocabulary = len(vocabulary)
print("Total characters: ", n_characters)
print("Total vocabulary: ", n_vocabulary)

Total characters:  437493
Total vocabulary:  66


Creating the inputs and outputs:
1. Define a variable `sequence_length` and give it the value 100
2. Define empty lists dataX and dataY
3. Loop over the range `n_characters - sequence_length` using the iterator i
  - Define a variable `sequence_input` which is the text from index i until index i + `sequence_length`
  - Define a variable `sequence_output` which is the character in the text at position `i + sequence_length`
  - Transform `sequence_input` to integers and append it to `dataX`
  - Transform `sequence_output` to an integer and append it to `dataY`
4. Print the length of dataX

In [37]:
sequence_length = 100
dataX = []
dataY = []
for i in range(n_characters - sequence_length):
  sequence_input = raw_text[i:i + sequence_length]
  sequence_output = raw_text[i + sequence_length]
  dataX.append([character_to_integer[char] for char in sequence_input])
  dataY.append(character_to_integer[sequence_output])

n_patterns = len(dataX)
print("Total patterns: ", n_patterns)

Total patterns:  437393


## Reshape our inputs 
Create a new variable X that will hold the result of `numpy.reshape` on dataX with the new shape `(length of dataX, length of a sequence, features (1))`.

In [38]:
X = numpy.reshape(dataX, (n_patterns, sequence_length, 1))

## Normalize the inputs
Update X to be X divided by our vocabulary count.

In [39]:
X = X / float(n_vocabulary)

## One hot encode the outputs
Define a new variable y that contains dataY one hot encoded using `utils.to_categorical`.



In [40]:
y = utils.to_categorical(dataY)
print(y[0])

[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


## Designing the model
1. Define a variable model that is an empty `Sequential`
2. Add an LSTM layer with 256 units, and the input shape `(X.shape[1], X.shape[2])` 
3. Add a dropout layer with dropout rate .2
4. Add a dense layer with `y.shape[1]` units and a softmax activation
5. Compile the model using `categorical_crossentropy` as the loss and an adam optimizer
6. Print the model summary
7. BONUS: Add an extra LSTM and Dropout layer for improved results (this takea longer to train) - in this case you only need to tell the following LSTM layer the count of memory units. You also need to add "return_sequences=True" to the first LSTM layer.



In [41]:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [42]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 100, 256)          264192    
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 256)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 66)                16962     
Total params: 806,466
Trainable params: 806,466
Non-trainable params: 0
_________________________________________________________________


## Define checkpoints
1. Define a variable `filepath` for a hdf5 file containing the epoch and loss, e.g. `"weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"`
2. Define a variable `checkpoint` to hold an instance of ModelCheckpoint for that filepath, where monitor='loss', save_best_only=True and the mode='min'
3. Define a variable `callbacks_list` that only contains the above defined checkpoint

In [43]:
filepath = "weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor="loss", verbose=1, save_best_only=True, mode="min")
callbacks_list = [checkpoint]

## Training our model
- Call model.fit for X and y with 20 epochs, a batch size of 128 and the callbacks list you defined above.
- BONUS: If you want to, try descreasing the batch size and increasing the epoch count to something like 64 and 50 - this can yield better results since there are more chances to learn.
- Go get some coffee! ☕️☕️☕️

In [None]:
model.fit(X, y, epochs=50, batch_size=64, callbacks=callbacks_list)

Epoch 1/50

Epoch 00001: loss did not improve from 2.03973
Epoch 2/50

Epoch 00002: loss improved from 2.03973 to 2.02024, saving model to weights-improvement-02-2.0202.hdf5
Epoch 3/50

Epoch 00003: loss did not improve from 2.02024
Epoch 4/50

Epoch 00004: loss improved from 2.02024 to 2.01007, saving model to weights-improvement-04-2.0101.hdf5
Epoch 5/50

Epoch 00005: loss improved from 2.01007 to 1.90454, saving model to weights-improvement-05-1.9045.hdf5
Epoch 6/50

Epoch 00006: loss improved from 1.90454 to 1.88365, saving model to weights-improvement-06-1.8837.hdf5
Epoch 7/50

Epoch 00007: loss improved from 1.88365 to 1.85774, saving model to weights-improvement-07-1.8577.hdf5
Epoch 8/50

Epoch 00008: loss improved from 1.85774 to 1.84010, saving model to weights-improvement-08-1.8401.hdf5
Epoch 9/50

Epoch 00009: loss improved from 1.84010 to 1.82382, saving model to weights-improvement-09-1.8238.hdf5
Epoch 10/50

Epoch 00010: loss improved from 1.82382 to 1.80722, saving model

## Using the model

### Load the model with optimal weights
1. Define a variable `filename` and assign it the name of your lowest loss checkpoint file
2. Use `model.load_weigths` to load the weights from said file
3. Compile the model as we did before (categorical crossentropy and adam)


In [45]:
filename = "weights-improvement-18-2.0397.hdf5"
model.load_weights(filename)
model.compile(loss="categorical_crossentropy", optimizer="adam")

### Create a reverse mapping from ints to chars
Use the same strategy you had when creating the `characters_to_integers` dictionary to create a dictionary where the integers are the keys and the characters are the values.

In [46]:
integers_to_characters = dict((i, c) for i, c in enumerate(vocabulary))

### Generate some text
1. Import the sys module
2. Pick a random pattern from dataX to act as the seed and save it in a variable
3. Print the pattern as characters using `integer_to_character`
4. Loop 600 times
  - Create a variable x that holds the `numpy.reshape`d pattern with the new shape `(1, length of pattern, 1)`
  - Normalize x over `n_vocabulary`
  - Use `model.predict` to make a prediction for x and save it in a variable
  - Use `numpy.argmax` to choose the most probable result from the prediction and save it in a variable `result_index`
  - Create a variable `result` containing the corresponding character for `result_index`
  - Write that character to stdout using `sys.stdout.write`
  - Append `result_index` to the pattern
  - Drop the last character out of the pattern

In [47]:
import sys

In [None]:
dataX[0]

In [52]:
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("seed:")
print(''.join([integers_to_characters[value] for value in pattern]))
print("prediction:")

for i in range(100):
  x = numpy.reshape(pattern, (1, len(pattern), 1))
  x = x/float(n_vocabulary)
  prediction = model.predict(x, verbose=0)
  index = numpy.argmax(prediction)
  result = integers_to_characters[index]
  sys.stdout.write(result)
  pattern.append(index)
  pattern = pattern[1:len(pattern)]

seed:
xistence and of its unspeakable torments, dared to hope for
happiness, that while he accumulated wre
prediction:
tched the sass that i had been the same oooy and the sass oo the seman of the semeer of the same of 