<a href="https://colab.research.google.com/github/mgorkemuysal/TextGenerationFrankenstein/blob/master/text_generation_frankenstein.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Downloading and Preparing the Data**
I downloaded the data from Project Gutenberg website (https://www.gutenberg.org/browse/scores/top) and uploaded to my Github repository to get and use easily. After downloading, to make data useful for processing, we should read the data in lowercase and create an array of all text  

In [0]:
!git clone https://github.com/mgorkemuysal/TextGenerationFrankenstein.git
# Nothing is so painful to the human mind as a great and sudden change 
import string
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, LSTM 
from keras.optimizers import RMSprop
import random as rd 
import sys
np.seterr(divide = 'ignore')

Cloning into 'TextGenerationFrankenstein'...
remote: Enumerating objects: 7, done.[K
remote: Counting objects:  14% (1/7)[Kremote: Counting objects:  28% (2/7)[Kremote: Counting objects:  42% (3/7)[Kremote: Counting objects:  57% (4/7)[Kremote: Counting objects:  71% (5/7)[Kremote: Counting objects:  85% (6/7)[Kremote: Counting objects: 100% (7/7)[Kremote: Counting objects: 100% (7/7), done.[K
remote: Compressing objects:  16% (1/6)[Kremote: Compressing objects:  33% (2/6)[Kremote: Compressing objects:  50% (3/6)[Kremote: Compressing objects:  66% (4/6)[Kremote: Compressing objects:  83% (5/6)[Kremote: Compressing objects: 100% (6/6)[Kremote: Compressing objects: 100% (6/6), done.[K
Unpacking objects:  14% (1/7)   Unpacking objects:  28% (2/7)   Unpacking objects:  42% (3/7)   Unpacking objects:  57% (4/7)   Unpacking objects:  71% (5/7)   remote: Total 7 (delta 1), reused 0 (delta 0), pack-reused 0[K
Unpacking objects:  85% (6/7)   Unpacking objects

Using TensorFlow backend.


{'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}

 # **Inspecting and Cleaning the Data**
 We have a book of 440748 characters including punctuations and special characters. To improve our vocabulary and modeling process, we must get rid of these punctuations and special characters.

In [0]:
raw_text = open('./TextGenerationFrankenstein/frankenstein.txt', 'rt').read().lower()
print('Corpus lenght:', len(raw_text))

Corpus lenght: 440748


In [0]:
print(raw_text[:5000])

﻿
project gutenberg's frankenstein, by mary wollstonecraft (godwin) shelley

this ebook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  you may copy it, give it away or
re-use it under the terms of the project gutenberg license included
with this ebook or online at www.gutenberg.net


title: frankenstein
       or the modern prometheus

author: mary wollstonecraft (godwin) shelley

release date: june 17, 2008 [ebook #84]
last updated: january 13, 2018

language: english

character set encoding: utf-8

*** start of this project gutenberg ebook frankenstein ***




produced by judith boss, christy phillips, lynn hanninen,
and david meltzer. html version by al haines.
further corrections by menno de leeuw.



frankenstein;


or, the modern prometheus




by


mary wollstonecraft (godwin) shelley






contents




letter 1

letter 2

letter 3

letter 4

chapter 1

chapter 2

chapter 3

chapter 4

chapter 5

chapter 6

chapter 7

chapter 8

chapter

In [0]:
punc_list = list(string.punctuation) + ['“', '”', 'æ', 'è', 'é', 'ê', 'ô', '—', '‘', '’', '\ufeff']
print(punc_list)
 
def remove_punctuations(txt):
  for c in punc_list:
    txt = txt.replace(c, '')
  return txt

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '“', '”', 'æ', 'è', 'é', 'ê', 'ô', '—', '‘', '’', '\ufeff']


In [0]:
text = remove_punctuations(raw_text)
print(text[:5000])


project gutenbergs frankenstein by mary wollstonecraft godwin shelley

this ebook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever  you may copy it give it away or
reuse it under the terms of the project gutenberg license included
with this ebook or online at wwwgutenbergnet


title frankenstein
       or the modern prometheus

author mary wollstonecraft godwin shelley

release date june 17 2008 ebook 84
last updated january 13 2018

language english

character set encoding utf8

 start of this project gutenberg ebook frankenstein 




produced by judith boss christy phillips lynn hanninen
and david meltzer html version by al haines
further corrections by menno de leeuw



frankenstein


or the modern prometheus




by


mary wollstonecraft godwin shelley






contents




letter 1

letter 2

letter 3

letter 4

chapter 1

chapter 2

chapter 3

chapter 4

chapter 5

chapter 6

chapter 7

chapter 8

chapter 9

chapter 10

chapter 11

chapter 12



# **Vectorizing Sequences of Characters**
Let's split the book text up into subsequences with a fixed length of 60 characters. Each training pattern of the network is comprised of 60 time steps of one character(x) followed by one character output (y). When creating these sequences, we slide this window along the whole book one character at a time, allowing each character a chance to be learned from the 60 characters that preceded it. 

Also let's define a fixed step of 3. this step means that model in prediction phase will predict next character according to preceding 3 characters. For example: ex --> exa, xa --> xam, am --> ampl, mpl --> mple.

Another thing we have to do, convert the characters into integers because we cannot build and train a model with characters. Also since we want to build a character-level text generation, we need these character indices as tokens. To do this we must map these characters to indices like key-value pairs.

As result, we 26 letters of alphabet, 10 figures, newline and spaces as mapped.


In [0]:
maxlen = 60
step = 3
sentences = []
next_chars = []
 
for i in range(0, len(text) - maxlen, step):
  sentences.append(text[i: i + maxlen])
  next_chars.append(text[i + maxlen])
 
print('Number of Sequences:', len(sentences))
 
chars = sorted(list(set(text)))
print('\nCharacters:', chars)
print('Unique Characters:', len(chars))
 
char_indices = dict((char, chars.index(char)) for char in chars)
print('\nCharacter Dictionary:', char_indices)

Number of Sequences: 143120

Characters: ['\n', ' ', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Unique Characters: 38

Character Dictionary: {'\n': 0, ' ': 1, '0': 2, '1': 3, '2': 4, '3': 5, '4': 6, '5': 7, '6': 8, '7': 9, '8': 10, '9': 11, 'a': 12, 'b': 13, 'c': 14, 'd': 15, 'e': 16, 'f': 17, 'g': 18, 'h': 19, 'i': 20, 'j': 21, 'k': 22, 'l': 23, 'm': 24, 'n': 25, 'o': 26, 'p': 27, 'q': 28, 'r': 29, 's': 30, 't': 31, 'u': 32, 'v': 33, 'w': 34, 'x': 35, 'y': 36, 'z': 37}


# **Transforming Data for Model Building**
Because of the lstm networks need tensors as input and target, we need to create 3d tensor (numpy array x of shape(sequences, maxlen, unique characters)) for lstm input shape and 2d tensor (numpy array y of shape(sequnces, unique characters)) for target. 

To do this, we must one-hot encode the sentences we created before.

In [0]:
# Vectorization
x = np.zeros((len(sentences), maxlen, len(chars)), dtype = np.bool)
y = np.zeros((len(sentences), len(chars)), dtype = np.bool)
 
for i, sentence in enumerate(sentences):
  for t, char in enumerate(sentence):
    x[i, t, char_indices[char]] = 1
  y[i, char_indices[next_chars[i]]] = 1

In [0]:
x[0]

array([[False, False, False, ..., False, False,  True],
       [ True, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [0]:
y[0]

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False])

# **Building an LSTM Model**
Since we will predict 38 unique characters, we need to use softmax activation function on last layer of the model and categorical_crossentropy loss function during training process.

return_sequences = True means that we pass the output of one layer to another. 

In [0]:
model = Sequential()
model.add(LSTM(64, input_shape = (maxlen, len(chars)), return_sequences = True))
model.add(LSTM(128, return_sequences = True))
model.add(LSTM(256))
model.add(Dense(len(chars), activation = 'softmax'))

In [0]:
model.compile(loss = 'categorical_crossentropy',
              optimizer = RMSprop(lr = 0.01))

In [0]:
model.fit(x, y,
          batch_size = 512,
          epochs = 30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [0]:
model.save('first_model')

# **Training the Language Model and Sampling from it**
To generate text from trained language model, we must doing the following repeatedly:


1.   Draw from the model a probability distribution for the next character given the generated text available so far.
2.   Reweight the distribution to a certain temperature.
3.   Sample the next character at random according to the reweighted distribution.
4.   Add the new character at the end of the available text.

In sampling, temperature value decides entropy of next character prediction. If temperature value is low then entropy and randomness of the next character prediction will be predictable and repetitive. This is called greedy sampling. If the temperature value is high the entropy and randomness of the next character will have high randomness and hard to predict. This is called stochastic sampling.

To create a new and different sequences from text, we can use high temperature, on the other hand, to create more for example english-like sequences, we can use low temperature on prediction process. 

Let's use different scale of temperature to investigate the effect on our character predictions.





In [0]:
def sample(preds, temperature = 1.0):
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

In [0]:
def generate_text(model, sentence):
  # start_index = rd.randint(0, len(text) - maxlen - 1)
  # generated_text = text[start_index: start_index + maxlen]
  generated_text = sentence[0: maxlen]
  print('\n\t--- Generating with seed --> "' + generated_text + '"')
 
  for temperature in [0.2, 0.5, 1.0, 1.2]:
    print('\n\t--- Temperature -->', temperature)
    sys.stdout.write(generated_text)
 
    for i in range(200):
      sampled = np.zeros((1, maxlen, len(chars)))
      for t, char in enumerate(generated_text):
        sampled[0, t, char_indices[char]] = 1.
          
      preds = model.predict(sampled)[0]
      next_index = sample(preds, temperature)
      next_char = chars[next_index]
 
      generated_text += next_char
      generated_text = generated_text[1:]
 
      sys.stdout.write(next_char)
      sys.stdout.flush()
    print()

# **Generating Text**
When we create a text using and predict from the language model, we can clearly see that, with 0.2 and 0.5 values of temperature, we can generate meaningful words of english and with 1.0 and 1.2 values of temperature, we can generate english-like words.

We should not expect to predict meaningful context at this stage because the data and the training epochs are so small. If we train the model with more data and epoch, prediction may have meaningful context.

In [0]:
generate_text(model, 'beware for i am fearless and therefore powerful i will watch')


	--- Generating with seed --> "beware for i am fearless and therefore powerful i will watch"

	--- Temperature --> 0.2
beware for i am fearless and therefore powerful i will watching a torment or the sensation of the hope of the despair the same an account of the most creator was a wretch i have not wished to the sun feeded she was to assured the sea by the secret he said the 

	--- Temperature --> 0.5
feeded she was to assured the sea by the secret he said the secret her also be among the secret of communious of marn and the family and the arabiant of
your enemy and have surprised i shall be even put the useries of my dear very despair  the envoxute sympath

	--- Temperature --> 1.0
ut the useries of my dear very despair  the envoxute sympathy
and streaming occur overcammed mr kindless nature when perhithed you
to his misery into me seemed to the most of splank
younger complying to explain  royage or restraited me all have you will commen

	--- Temperature --> 1.2
to explain  royage 

# **Inspecting Different Models to improve Text Generation**
Let's inspect what we can do to generate more meaningful text generation. To do this, first of all, i want to use Bidirectional layer with LSTM to create a new model.

A bidirectional RNN exploits the order sensitivity of  RNNs: it consists of using two regular RNNs, such as the GRU and LSTM layers each of which processes the input sequence in one direction and  then  merging  their  representations. By processing a sequence both ways, a bidirectional RNN can catch patterns thatmay be overlooked by a unidirectional RNN.

For Example:
Let's say we try to predict the next word in a sentence, on a high level what a unidirectional LSTM will see is
* The boys went to ....

And will try to predict the next word only by this context, with bidirectional LSTM you will be able to see information further down the road for example

Forward LSTM:
* The boys went to ...

Backward LSTM:
* ... and then they got out of the pool

You can see that using the information from the future it could be easier for the network to understand what the next word is.

In theory, using the Bidirectional layer I can expect to predict better text.


In [0]:
from keras.layers import Bidirectional
 
model = Sequential()
model.add(Bidirectional(LSTM(256, input_shape = (maxlen, len(chars)), return_sequences = True)))
model.add(Bidirectional(LSTM(256, return_sequences = True)))
model.add(Bidirectional(LSTM(512)))
model.add(Dense(len(chars), activation = 'softmax'))

In [0]:
model.compile(loss = 'categorical_crossentropy',
              optimizer = RMSprop(lr = 0.01))

In [0]:
model.fit(x, y,
          batch_size = 512,
          epochs = 30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.callbacks.History at 0x7f31c5fd2048>

In [0]:
model.save('second_model')

In [0]:
generate_text(model, 'beware for i am fearless and therefore powerful i will watch')


	--- Generating with seed --> "beware for i am fearless and therefore powerful i will watch"

	--- Temperature --> 0.2
beware for i am fearless and therefore powerful i will watchfing the straw and the monster and i in the beings of the greatest warmine and the secoment of the greatest
aspect the fiend of the greatest
and the sea and i am my feelings and intervered me and i wa

	--- Temperature --> 0.5

and the sea and i am my feelings and intervered me and i was the lovely plan’

i was sooth you even it is
the truth of the
most destrain had been
to be a few heaven in the peet i am no among the coppanion  i had been visited in the house who was more and one 

	--- Temperature --> 1.0
anion  i had been visited in the house who was more and one of sight

and not man arguilling i did not be it gazed on the summit of domes on your gentle
also and supposite and in the more ceas me and

# **Generating Text**
We can clearly say that the new text generation is worse than before for all temperature value. I think the reason behind this is that we stuck in local minimum when training the model. Loss value is not decreasing much after about 20th epoch. Let's try to solve this problem and generate better text.

# **Inspecting Different Parameters to improve Text Generation**
I change loss value to 'adam' for decreasing loss value as much as possible in training process.

Let's examine whether we have made progress or not.

In [0]:
from keras.layers import Bidirectional
 
model = Sequential()
model.add(Bidirectional(LSTM(64, input_shape = (maxlen, len(chars)), return_sequences = True)))
model.add(Bidirectional(LSTM(128, return_sequences = True)))
model.add(Bidirectional(LSTM(256)))
model.add(Dense(len(chars), activation = 'softmax'))

In [0]:
 model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

In [0]:
model.fit(x, y, batch_size = 512, epochs = 30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.callbacks.History at 0x7fcc0f2986d8>

In [0]:
model.save('third_model')

In [0]:
 generate_text(model, 'beware for i am fearless and therefore powerful i will watch')


	--- Generating with seed --> "beware for i am fearless and therefore powerful i will watch"

	--- Temperature --> 0.2
beware for i am fearless and therefore powerful i will watching and one of the most friends which i had formed the most felix was paid the father the sun doel content to project gutenbergtm electronic works in the maintenance of the most friends when i reflect

	--- Temperature --> 0.5
 works in the maintenance of the most friends when i reflected the histreas of a sometimes i said the subst that i am hardly knees her miserable and a cursed by regater so on the project gutenbergtm
curiosity and far more my eyes that feelings as that she
foll

	--- Temperature --> 1.0
uriosity and far more my eyes that feelings as that she
followed yet one of the facts of donations that was on the horsen of the room weigh on the most carorsed by a thousand fear upon
joy state or the slass
hirstorined for there and that of her

chelgest caro

	--- Temperature --> 1.2
e slass
hirstorined

# **Generating Text**
We can clearly see that we made progress on loss value and character prediction according to preceding model. Again, for 0.2 and 0.5 of temperature values, the model generated english words. For 1.0 and 1.2 of temperature values, the model generated english-like, random words.

# **Conclusion and Future Works**
We can accept first and third models to generate a "Frankenstein-like" text as beginning step. 

To improve writing skills of the model, we must train the model with more text data, more layers and more epoch like over 100. 