**Homework 18**

In this assignment your will train a RNN to predict characters of *Alice in Wonderland*, from strings of consecutive characters.

We begin as usual with the imports you will need for this assignment.

In [None]:
import numpy as np
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import InputLayer
from tensorflow.keras.layers import Softmax

from tensorflow.keras.layers import LSTM

Run the following text block to read *Alice in Wonderland* from the web, store it in the variable `text`, convert to lower case and remove punctuation.

In [None]:
import string
from urllib.request import urlopen
url='https://gist.githubusercontent.com/phillipj/4944029/raw/75ba2243dd5ec2875f629bf5d79f6c1e4b5a8b46/alice_in_wonderland.txt'
text = urlopen(url).read().decode('utf-8')
text=text.lower()
text=[c for c in text if (c not in string.punctuation) and (c!='\n')]

Write a class `Tokenizer` with the following methods:


*   `__init__`, a method that builds a dictionary `tokens` whose keys are the set of unique characters in some input `text`, and values are integers.
*   `encode`, a method that takes in a corpus of text, converts each character according to the dictionary built by the __init__ method, and outputs a list of those integers.
*   `decode`, a method that takes a single integer (a value from the dictionary), and returns the corresponding character key.



In [None]:
class Tokenizer():
  def __init__(self,text):
    #Build a dictionary of tokens
    self.tokens = {}
    for c in set(text):
      self.tokens[c] = len(self.tokens)

  def encode(self,text):
    #Encode text using token dictionary, outputs list of those integers
    encoded_text = [self.tokens[c] for c in text]
    return encoded_text

  def decode(self,n):
    #Decode integer n to corresponding character
    for c, i in self.tokens.items():
      if i == n:
        return c
    return None





```
# This is formatted as code
```

Now, create an object called `tok` of your `Tokenizer` class, and use it to encode `text` as a list of integers, `text_indices`.

In [None]:
tok=Tokenizer(text)
text_indices=tok.encode(text)

For convenience, we'll define `n` to be the length of your tokenizer dictionary:

In [None]:
n=len(tok.tokens)

The next task is to create feature sequences and targets. From `text_indices`, create a list-of-lists `X`. Each sublist of `X` should correspond to 50 consecutive elements of `text_indices`. At the same time, create a list `y` which contains the indices of the characters that follow each sublist of `X`. For example, `X[0]` should be a list containing the first 50 elements of `text_indices`: `text_indices[0]` through `text_indices[49]`. `y[0]` should be the 51st element, `text_indices[50]`. Something very similar was done in Homework 17.

To keep the size of the feature and target vectors manageable, consecutive lists in `X` should be shifted by 3, so the overlap is 47 elements. Hence, `X[1]` should be a list containing the integers `text_indices[3]` through `text_indices[52]`, and `y[1]` should be the integer `text_indices[53]`.

In [None]:
X=[]
y=[]
for i in range(0,len(text_indices)-50-1,3):
  X.append(text_indices[i:i+50])
  y.append(text_indices[i+50])

Convert `X` and `y` to numpy arrays with the same names, and check their shapes. If done correctly, the shape of `X` should be (45539, 50) and the shape of `y` should be (45539, ):

In [None]:
X=np.array(X)
y=np.array(y)
X.shape, y.shape

((45539, 50), (45539,))

Use the `to_categorical` function again to convert both `X` and `y` to one-hot encoded vectors of 0's and 1's, and check their shapes again. You should now have shapes (45539,50,29) and (45539,29). In other words, the vector `X` now contains 46,738 sequences of length 50, and each element of each sequence is a 30-dimensional vector of 29 zeros and a single one in the entry corresponding to some character in the text.

In [None]:
X = to_categorical(X, num_classes=n)
y = to_categorical(y, num_classes=n)
X.shape, y.shape

((45539, 50, 29), (45539, 29))

You're now ready to create your model. Create a neural network called `model`. This should have an input layer, a recurrent layer with 128 neurons, a dense layer, and a softmax layer. For your recurrent layer, you can use SimpleRNN, or something more sophsticated like an LSTM. (You'll get better results with  LSTM, but it will take MUCH longer. You can mitigate this by reducing the length of each sequence in X down to 10.) The number of neurons in your dense layer should be appropriate to predict the categorical variable `y`.

In [None]:
model = Sequential()
model.add(InputLayer(input_shape=(50, len(tok.tokens))))
model.add(SimpleRNN(128))
model.add(Dense(len(tok.tokens)))
model.add(Softmax())

In [None]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_3 (SimpleRNN)    (None, 128)               20224     
                                                                 
 dense_3 (Dense)             (None, 29)                3741      
                                                                 
 softmax_3 (Softmax)         (None, 29)                0         
                                                                 
Total params: 23,965
Trainable params: 23,965
Non-trainable params: 0
_________________________________________________________________


Compile your model using the `Adam` optimizer and an approporiately chosen loss function.

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

Fit your data to X and y. Train for 50 epochs with a batch size of 128. Each epoch will take about 95 seconds, so you'll want to leave your computer for about an hour for this to complete.

In [None]:
model.fit(X, y, epochs=50, batch_size=128)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f610df96cd0>

We will now use your trained model to generate text, one character at a time. Run the following code block to do this. (It will take a minute or two to complete.) Its interesting that although the model generates one character at a time, you'll see very word-like strings in the final text.

In [None]:
seq=[np.random.randint(0,len(tok.tokens)) for i in range(50)] #50 random integers for inital prediction
seq=to_categorical(np.array(seq),num_classes=len(tok.tokens)) #one-hot encode initial sequence

newtext=''
for i in range(100):
  pred_probs=model.predict(seq.reshape(1,50,len(tok.tokens))) #Use model to generate probs for next char
  index_pred=np.random.choice(n,1,p=pred_probs.reshape(n))[0] #choose one
  newtext+=tok.decode(index_pred) #corresponding character
  seq=np.vstack([seq,to_categorical(index_pred,num_classes=len(tok.tokens))]) #add element to end of sequence
  seq=seq[1:] #remove 1st element from sequence so we have another sequence of length 50

newtext #display generated text



' yem that if it make you wont gotagetreatlersew and and weve  jo tail  in then and soas was upon you'

**COPY AND PASTE THIS TEXT INTO THE SUBMISSION WINDOW ON GRADESCOPE**