# Text Generating using RNN
```In this exercise you will use a recurrent neural network architecture. It's main purpose if for you to gain confidence when working with networks, while having fun with an interesting and simple application of them.```

```This exercise is based on blog post which you can find at``` https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

```~Ittai Haran```

In [1]:
# Load LSTM network and generate text
import sys
import numpy as np
from keras.models import Model
from keras.layers import Dense, Dropout, LSTM, Input, GRU
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using Theano backend.
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Quadro K4200 (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 6021)


## Part I
```Generating text by generating letters.```

```Start by loading the text of Alice in Wonderland by Lewis Carroll. Cut away the header and transform the entire text into lower case. Finish when you have lower cased string, containing the story.```

In [3]:
# load ascii text and covert to lowercase
filename = "data/wonderland.txt"
raw_text = open(filename).read()
raw_text = raw_text[720:]
raw_text = raw_text.lower()

```Create a mapping between the unique characters in the text and integers. Create the reverse mapping.```

In [3]:
# create mapping of unique chars to integers, and a reverse mapping
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  150317
Total Vocab:  47


```Create the dataset: your network is about to get vectors with 20 characters (or, to be precised, the integers replacing those characters), and predict the next character. Save your results in dataX and dataY. Make sure you do have integer vectors rather than vectors of characters. Transform the integer vectors of dataX to matrices of (number of vectors (20)) X (number of different letters) using 1-hot encoding. Do the same to dataY.```

In [4]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 20
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append(np.eye(n_vocab)[[char_to_int[char] for char in seq_in]])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  150297


In [5]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, n_vocab))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

```Create a simple RNN model with one hidden LSTM layer with 256 units and dropout with rate of 0.2```

In [18]:
# define the LSTM model
input_layer = Input((X.shape[1],X.shape[2]))
hidden_layer = LSTM(256, activation='tanh')(input_layer)
hidden_layer = Dropout(0.2)(hidden_layer)
output_layer = Dense(y.shape[1], activation='softmax')(hidden_layer)
model = Model(inputs = [input_layer], outputs = [output_layer])

model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 20L, 61L)          0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 256)               325632    
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 61L)               15677     
Total params: 341,309
Trainable params: 341,309
Non-trainable params: 0
_________________________________________________________________


```Train your model. Use a callback to save your model after every epoch.```

In [None]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

```Noe we will use the model to generate text. Start by a random seed. that is, a random sequence you used when training the model. Do the following:```
- ```Predict the next letter.```
- ```Save the letter you got.```
- ```Add the predicted letter to the train (concatenate from the right).```
- ```Drop the left most letter in you sentence.```
- ```Repeat 1000 times.```
- ```Print the predicted sentences your model created :)```

In [None]:
# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

```What can you say of the generated text? is it readable? Did you get any real English words? Any real English sentences?```
```Try adding another LSTM+Dropout layer to your model. Are the results somehow better?```

In [None]:
# define the LSTM model
input_layer = Input((X.shape[1],X.shape[2]))
hidden_layer = LSTM(256, activation='tanh', return_sequence = True)(input_layer)
hidden_layer = Dropout(0.2)(hidden_layer)
hidden_layer = LSTM(256, activation='tanh')(hidden_layer)
hidden_layer = Dropout(0.2)(hidden_layer)
output_layer = Dense(y.shape[1], activation='softmax')(hidden_layer)
model = Model(inputs = [input_layer], outputs = [output_layer])

model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

## Part II
```Generating text by generating words using Word2Vec.```

```Start by loading a word2vec model and a word tokenizer (using nltk).```

In [6]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from gensim.models.keyedvectors import KeyedVectors

twt = RegexpTokenizer(r'\w+')
word2vec = KeyedVectors.load_word2vec_format('resources/GoogleNews-vectors-negative300.bin.gz', binary=True, )



```Tokenize the text's words to get a list of the words of the story. What words your word2vec model doesn't recognize? Try filtering out such words, or fixing other words, while maintaining minimal impact over the original text.```

In [7]:
stop_words = set(stopwords.words('english'))
tokenized_text = filter(lambda x: not x in stop_words, twt.tokenize(raw_text.replace('-', ' ').replace('\xe2', ' ')))

```The book is written by a british author, but word2vec is trained after the american style. Luckily, ```http://www.tysto.com/uk-us-spelling-list.html ```contains a list of pairs, converting british style to american style.
We will use this list to make ourselves a side-quest for reading and parsing the page using the BeautifulSoup module. Read about it and use google to figure out how to parse the page. Create a dictionary that will help you translate the british style to american style and use it to clean your text.```

In [1]:
import requests
from bs4 import BeautifulSoup

req = requests.get("http://www.tysto.com/uk-us-spelling-list.html")
soup = BeautifulSoup(req.content, "html.parser")

table = soup.find_all('table')[1].contents[3].contents
british = table[1].text
british = british.split()
american = table[3].text
american = american.split()
british_to_american = dict(zip(british, american))

In [9]:
tokenized_text = [british_to_american.get(x,x) for x in tokenized_text]
tokenized_text = [x for x in tokenized_text if x in word2vec]
tokenized_text_unique = list(set(tokenized_text))

```Create the word_to_num and num_to_word dictionaries as you did earlier with the characters.```

In [10]:
int_to_word = dict(enumerate(tokenized_text_unique))
word_to_int = {v:k for k,v in int_to_word.iteritems()}

n_words = len(tokenized_text)
n_vocab = len(tokenized_text_unique)

```Create a dataset. This time we will not use a 1-hot encoding, but an Embedding layer. Hence, each sample would be made of 10 numbers between 0 and the size of your word_to_int dictionary. We would like our model to predict probability over all the words that appeared in our tokenized text. Build your target that way.```

In [11]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 10
dataX = []
dataY = []
for i in range(0, n_words - seq_length, 1):
    seq_in = tokenized_text[i:i + seq_length]
    seq_out = tokenized_text[i + seq_length]
    dataX.append([word_to_int[word] for word in seq_in])
    dataY.append(word_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  12158


```Create a matrix of the size (number of different words)X(dimension of word2vec vectors), the i'th row is the vector of int_to_word[i].```

In [12]:
X = np.array(dataX)
Y = np.eye(n_vocab)[dataY]
matrix = np.array([word2vec[x[1]] for x in int_to_word.items()])

```Build the model. Use an embedding layer and initialize it by specifying weights = [matrix] in its builder. Besides that, use the same architecture you used earlier. Train your model. Try 2 different attitudes: training the embedding layer, or freezing it.```

In [13]:
from keras.layers import Embedding

input_layer = Input((int(X.shape[1]),))
embedding = Embedding(input_dim=len(word_to_int), output_dim=300,
                      weights=[matrix], input_length=seq_length, trainable = False)(input_layer)
hidden_layer = LSTM(256, activation='tanh', return_sequences=True)(embedding)
hidden_layer = Dropout(0.2)(hidden_layer)
hidden_layer = LSTM(256, activation='tanh')(hidden_layer)
hidden_layer = Dropout(0.2)(hidden_layer)
output_layer = Dense(len(word_to_int), activation='softmax')(hidden_layer)
                     
model = Model(inputs = [input_layer], outputs = [output_layer])

model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 10)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 10, 300)           716700    
_________________________________________________________________
lstm_1 (LSTM)                (None, 10, 256)           570368    
_________________________________________________________________
dropout_1 (Dropout)          (None, 10, 256)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 2389)              613973    
Total para

In [None]:
model.fit(X, Y, verbose = 1, batch_size = 100, epochs = 50)

```Time for predicting! Do as you did with the characters to generate text by generating words.```

In [None]:
# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ' '.join([int_to_word[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_word[index]
    seq_in = [int_to_word[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

### Reading tasks
```Read about the followings:```
- ```Transformer neural networks```
- ```Self attention mechanism```
- ```Bert```

```Talk about this concepts with your tutor.```