# Text Generating using RNN
```In this exercise you will use a recurrent neural network architecture. It's main purpose if for you to gain confidence when working with networks, while having fun with an interesting and simple application of them.```


```~Ittai Haran.```

```comments by: Roy Amir```

In [0]:
import sys
import numpy as np
from keras.models import Model
from keras.layers import Dense, Dropout, LSTM, Input, GRU
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Part I
```Generating text by generating letters.```

```1) Start by loading the text of Alice in Wonderland by Lewis Carroll. Cut away the header and transform the entire text into lower case. Finish when you have lower cased string, containing the story. (header ends at ~ 700. check the exact spot)```

```2) Create a mapping between the unique characters in the text and integers. Create the reverse mapping.```

```3) Create the dataset: your network is about to get vectors with 20 characters (or, to be precised, the integers replacing those characters), and predict the next character. Save your results in dataX and dataY.```

```
(The idea is to slide a "window" with a length of 21 words over the story's string, each time saving the 20 first words as x, and the 21st word as y)```

```Make sure you do have integer vectors rather than vectors of characters. Transform the integer vectors of dataX to matrices of (number of vectors (20)) X (number of different letters) using 1-hot encoding. Do the same to dataY.```

```4) Create a simple RNN model with one hidden LSTM layer with 256 units and dropout with rate of 0.2. use categorical crossentropy as loss.```

```5) Train your model. Use a callback to save your model after every epoch.```

```6) Now we will use the model to generate text. Start by a random seed. that is, a random sequence you used when training the model. Do the following:```
 ```Predict the next letter.```
- ```Save the letter you got.```
- ```Add the predicted letter to the train (concatenate from the right).```
- ```Drop the left most letter in you sentence.```
- ```Repeat 1000 times.```
- ```Print the predicted sentences your model created :)```

```7) What can you say of the generated text? is it readable? Did you get any real english words? Any real English sentences?```
```Try adding another LSTM+Dropout layer to your model.Are the results somehow better? ```

```(remember return_sequences = True if and only if the LTSM layer comes before another LTSM layer)```

## Part II
```Generating text by generating words using Word2Vec.```


```8) Start by loading a word2vec model and a word tokenizer (using nltk).```

```(you are supposed to use the huge file we once told you to download: "GoogleNews-vectors-negative300.bin" and use gensim.models.KeyedVectors.load_word2vec_format to load the model from it. word_tokenize from nltk is a good tokenizer)```

```9) Tokenize the text's words to get a list of the words of the story. What words your word2vec model doesn't recognize? Try filtering out such words, or fixing other words, while maintining minimal impact over the original text.```

```(do not  blindly delete all the words that don't exist in the model!! Focus on **technical** reasons causing misrecognition, and fix them)```


```10) What about the other unrecognised words? Some words are misrecognised because of language differences:```

```The book is written by a british author, but word2vec is trained after the american style. Luckily, you are provided with a british-to-american dictionary, to help you translate the british style to american style. Use it to clean your text.```

```(use the file that was attached with this exercise, british_to_american.pkl", and the pickle module. after dictionary check, you may delete any remaining unrecognised words)```

---



```11) Create the word_to_num and num_to_word dictionaries as you did earlier with the characters.```

```12) Create a dataset. This time we will not use a 1-hot encoding, but an Embedding layer. Hence, each sample would be made of 10 numbers between 0 and the size of your word_to_int dictionary. We would like our model to predict probability over all the words that appeared in our tokenized text. Build your target that way.```

``` (do the same window-thing we did before, this time with a window 10 units long, when you translate every word to its corresponding number. the target (11-th word) should be hot encoded)```

```13) Create a matrix of the size (number of different words)X(dimension of word2vec vectors), the i'th row is the vector of int_to_word[i].```

```(Word2Vec vector. np.zeros could be handy in creating an empty matrix in the desired size)```

```14) Build the model. Use an embedding layer and initialize it by specifying weights = [matrix] in its builder. Besides that, use the same architecture you used earlier. Train your model. Try 2 different attitudes: training the embeding layer, or freezing it.```

```(embedding layer: input_dim = Size of the vocabulary,output_dim = Size of transformed vectors. to "freeze" a layer use layer.trainable = False/True)```

```15) Time for predicting! Do as you did with the characters to generate text by generating words.```

``` (It takes some epochs to get to results. achieving over 90% accuracy is possible, and recommended. Let your computer run while you do other stuff)```