# Text Generating using RNN
```In this exercise you will use a recurrent neural network architecture. It's main purpose if for you to gain confidence when working with networks, while having fun with an interesting and simple application of them.```

```This exercise is based on blog post which you can find at``` https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

```~Ittai Haran```

In [0]:
# Load LSTM network and generate text
import sys
import numpy as np
from keras.models import Model
from keras.layers import Dense, Dropout, LSTM, Input, GRU
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using TensorFlow backend.


## Part I
```Generating text by generating letters.```

1) 
``` Start by loading the text of Alice in Wonderland by Lewis Carroll. Cut away the header and transform the entire text into lower case. Finish when you have lower cased string, containing the story.```

In [0]:
# load ascii text and covert to lowercase
filename = "data/wonderland.txt"
raw_text = open(filename, encoding='utf-8').read()
raw_text = raw_text[720:]
raw_text = raw_text.lower()

2) 
```Create a mapping between the unique characters in the text and integers. Create the reverse mapping.```

In [0]:
# create mapping of unique chars to integers, and a reverse mapping
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  144339
Total Vocab:  45


3) ```Create the dataset: your network is about to get vectors with 20 characters (or, to be precised, the integers replacing those characters), and predict the next character. Save your results in dataX and dataY. Make sure you do have integer vectors rather than vectors of characters. Transform the integer vectors of dataX to matrices of (number of vectors (20)) X (number of different letters) using 1-hot encoding. Do the same to dataY.```

In [0]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 20
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append(np.eye(n_vocab)[[char_to_int[char] for char in seq_in]])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  144319


In [0]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, n_vocab))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

4) ```Create a simple RNN model with one hidden LSTM layer with 256 units and dropout with rate of 0.2```

In [0]:
# define the LSTM model
input_layer = Input((X.shape[1],X.shape[2]))
hidden_layer = LSTM(256, activation='tanh')(input_layer)
hidden_layer = Dropout(0.2)(hidden_layer)
output_layer = Dense(y.shape[1], activation='softmax')(hidden_layer)
model = Model(inputs = [input_layer], outputs = [output_layer])

model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 20, 45)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               309248    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 45)                11565     
Total params: 320,813
Trainable params: 320,813
Non-trainable params: 0
_________________________________________________________________


5) ```Train your model. Use a callback to save your model after every epoch.```

In [0]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(X, y, epochs=20, batch_size=500, callbacks=callbacks_list)

Epoch 1/20

Epoch 00001: loss improved from inf to 0.83338, saving model to weights-improvement-01-0.8334.hdf5
Epoch 2/20

Epoch 00002: loss improved from 0.83338 to 0.80720, saving model to weights-improvement-02-0.8072.hdf5
Epoch 3/20

Epoch 00003: loss improved from 0.80720 to 0.78393, saving model to weights-improvement-03-0.7839.hdf5
Epoch 4/20

Epoch 00004: loss improved from 0.78393 to 0.76053, saving model to weights-improvement-04-0.7605.hdf5
Epoch 5/20

Epoch 00005: loss improved from 0.76053 to 0.73810, saving model to weights-improvement-05-0.7381.hdf5
Epoch 6/20

Epoch 00006: loss improved from 0.73810 to 0.71517, saving model to weights-improvement-06-0.7152.hdf5
Epoch 7/20

KeyboardInterrupt: 

In [0]:
from keras.losses import categorical_crossentropy, binary_crossentropy
import keras.backend as K

In [0]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Epoch 1/20

Epoch 00001: loss improved from inf to 0.80306, saving model to weights-improvement-01-0.8031.hdf5
Epoch 2/20

Epoch 00002: loss improved from 0.80306 to 0.78025, saving model to weights-improvement-02-0.7802.hdf5
Epoch 3/20

Epoch 00003: loss improved from 0.78025 to 0.76139, saving model to weights-improvement-03-0.7614.hdf5
Epoch 4/20
  5504/144319 [>.............................] - ETA: 57s - loss: 0.6611

KeyboardInterrupt: 

6) ```Now we will use the model to generate text. Start by a random seed. that is, a random sequence you used when trianing the model. Do the following:```
- ```Predict the next letter.```
- ```Save the letter you got.```
- ```Add the predicted letter to the train (concatenate from the right).```
- ```Drop the left most letter in you sentence.```
- ```Repeat 1000 times.```
- ```Print the predicted sentences your model created :)```

In [0]:
# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[np.argmax(value)] for value in pattern]), "\"")
# generate characters
for i in range(1000):
    x = np.reshape(pattern, (1, len(pattern), 45))
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[np.argmax(value)] for value in pattern]
    sys.stdout.write(result)
    pattern = np.concatenate([pattern, np.eye(len(int_to_char))[index:index+1]], axis = 0)
    pattern = pattern[1:len(pattern)]
print("\n\n\nDone.")

Seed:
" like
that!’

‘i coul "
dn’t help it,’ said alice very politely, for she was
beginning to feet very too thing--there was a little down with on. ‘that’s the caterpillar took the hooker, and said no hing.

the lorsters to pear hery looked up and sharples up and the madchee of the sond in the ends. will you wouldn’t be stendy of the back, and then the queen was stoning for her fout her to go down the chimney, and fout hers little time the room of the surpless on it take the direace in the air. ho said here, she falled out of she know and the botton to sear the while said, and the march hare was her felt a right to little bit low, as she seord this very curious to be like the time, ther were notile of the court, and the mock turtle whither it was a lorg to the sand, and she had never seen such a chine, of down and her he said and hered in an inathat tome of the lestone, and book at the one, what i said i can’t be must of crishous of herens and the way of the bock, she foldousht this

7) ```What can you say of the generated text? is it readable? Did you get any real english words? Any real English sentences?```
```Try adding another LSTM+Dropout layer to your model. Are the results somehow better?```

In [0]:
# define the LSTM model
input_layer = Input((X.shape[1],X.shape[2]))
hidden_layer = LSTM(256, activation='tanh', return_sequence = True)(input_layer)
hidden_layer = Dropout(0.2)(hidden_layer)
hidden_layer = LSTM(256, activation='tanh')(hidden_layer)
hidden_layer = Dropout(0.2)(hidden_layer)
output_layer = Dense(y.shape[1], activation='softmax')(hidden_layer)
model = Model(inputs = [input_layer], outputs = [output_layer])

model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

## Part II
```Generating text by generating words using Word2Vec.```

8) ```Start by loading a word2vec model and a word tokenizer (using nltk).```

In [0]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

twt = RegexpTokenizer(r'\w+')

from gensim.models.keyedvectors import KeyedVectors
word2vec = KeyedVectors.load_word2vec_format('resources/GoogleNews-vectors-negative300.bin.gz',
                                             binary=True)

FileNotFoundError: ignored

9) ```Tokenize the text's words to get a list of the words of the story. What words your word2vec model doesn't recognize? Try filtering out such words, or fixing other words, while maintining minimal impact over the original text.```

In [0]:
stop_words = set(stopwords.words('english'))
tokenized_text = filter(lambda x: not x in stop_words, twt.tokenize(raw_text.replace('-', ' ').replace('\xe2', ' ')))

10) ```The book is written by a british author, but word2vec is trained after the american style. Luckily, you are provided with a british-to-american dictionary, to help you translate the british style to american style. Use it to clean your text.```

In [0]:
with open('resources/british_to_american.pkl', 'rb') as f:
    british_to_american = pickle.load(f)

In [0]:
tokenized_text = map(lambda x: british_to_american.get(x,x), tokenized_text)
tokenized_text = list(filter(lambda x: x in word2vec, tokenized_text))
tokenized_text_unique = list(set(tokenized_text))

11) ```Create the word_to_num and num_to_word dictionaries as you did earlier with the characters.```

In [0]:
int_to_word = dict(enumerate(tokenized_text_unique))
word_to_int = {v:k for k,v in int_to_word.items()}

n_words = len(tokenized_text)
n_vocab = len(tokenized_text_unique)

12) ```Create a dataset. This time we will not use a 1-hot encoding, but an Embedding layer. Hence, each sample would be made of 10 numbers between 0 and the size of your word_to_int dictionary. We would like our model to predict probability over all the words that appeared in our tokenized text. Build your target that way.```

In [0]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 10
dataX = []
dataY = []
for i in range(0, n_words - seq_length, 1):
    seq_in = tokenized_text[i:i + seq_length]
    seq_out = tokenized_text[i + seq_length]
    dataX.append([word_to_int[word] for word in seq_in])
    dataY.append(word_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  12157


13) ```Create a matrix of the size (number of different words)X(dimension of word2vec vectors), the i'th row is the vector of int_to_word[i].```

In [0]:
X = np.array(dataX)
Y = np.eye(n_vocab)[dataY]
matrix = np.array(list(map(lambda x: word2vec[x[1]], int_to_word.items())))

14) ```Build the model. Use an embeding layer and initialize it by specifying weights = [matrix] in its builder. Besides that, use the same architecture you used earlier. Train your model. Try 2 different attitudes: training the embeding layer, or freezing it.```

In [0]:
LSTM?

In [0]:
from keras.layers import Embedding

input_layer = Input((int(X.shape[1]),))
embedding = Embedding(input_dim=len(word_to_int), output_dim=300,
                      weights=[matrix], input_length=seq_length, trainable = False)(input_layer)
hidden_layer = LSTM(256, activation='tanh', return_sequences=True)(embedding)
hidden_layer = Dropout(0.2)(hidden_layer)
hidden_layer = LSTM(256, activation='tanh')(hidden_layer)
hidden_layer = Dropout(0.2)(hidden_layer)
output_layer = Dense(len(word_to_int), activation='softmax')(hidden_layer)
                     
model = Model(inputs = [input_layer], outputs = [output_layer])

model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 10)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 10, 300)           716700    
_________________________________________________________________
lstm_2 (LSTM)                (None, 10, 256)           570368    
_________________________________________________________________
dropout_2 (Dropout)          (None, 10, 256)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 2389)              613973    
Total para

In [0]:
model.fit(X, Y, verbose = 1, batch_size = 100, epochs = 50)

Epoch 1/50

KeyboardInterrupt: 

15) ```Time for predicting! Do as you did with the characters to generate text by generating words.```

In [0]:
# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ' '.join([int_to_word[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
    x = np.reshape(pattern, (1, len(pattern)))
#     x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_word[index] + ' '
    seq_in = [int_to_word[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
" head great curiosity friend mine cheshire cat said alice allow "
said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said said sa