# Task 2: Char-RNN

Char-RNN implements multi-layer Recurrent Neural Network (RNN, LSTM, and GRU) for training/sampling from character-level language models. In other words the model takes one text file as input and trains a Recurrent Neural Network that learns to predict the next character in a sequence. The RNN can then be used to generate text character by character that will look like the original training data. This network is first posted by Andrej Karpathy, you can find out about his original code on https://github.com/karpathy/char-rnn, the original code is written in *lua*.

Here we will implement Char-RNN using Tensorflow!

In [1]:
import time
import numpy as np
import tensorflow as tf

# Notebook auto reloads code. (Ref: http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython)
%load_ext autoreload
%autoreload 2

## Part 1: Setup
In this part, we will read the data of our input text and process the text for later network training. There are two txt files in the data folder, for computing time consideration, we will use tinyshakespeare.txt here.

In [2]:
with open('tinyshakespeare.txt', 'r') as f:
    text=f.read()
# length of text is the number of characters in it
print('Length of text: {} characters'.format(len(text)))
# and let's get a glance of what the text is
print(text[:500])

Length of text: 1115394 characters
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


In [3]:
# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

65 unique characters


In [4]:
# Creating a mapping from unique characters to indices
vocab_to_ind = {c: i for i, c in enumerate(vocab)}
ind_to_vocab = dict(enumerate(vocab))
text_as_int = np.array([vocab_to_ind[c] for c in text], dtype=np.int32)

# We mapped the character as indexes from 0 to len(vocab)
for char,_ in zip(vocab_to_ind, range(20)):
    print('{:6s} ---> {:4d}'.format(repr(char), vocab_to_ind[char]))
# Show how the first 10 characters from the text are mapped to integers
print ('{} --- characters mapped to int --- > {}'.format(text[:10], text_as_int[:10]))

'\n'   --->    0
' '    --->    1
'!'    --->    2
'$'    --->    3
'&'    --->    4
"'"    --->    5
','    --->    6
'-'    --->    7
'.'    --->    8
'3'    --->    9
':'    --->   10
';'    --->   11
'?'    --->   12
'A'    --->   13
'B'    --->   14
'C'    --->   15
'D'    --->   16
'E'    --->   17
'F'    --->   18
'G'    --->   19
First Citi --- characters mapped to int --- > [18 47 56 57 58  1 15 47 58 47]


## Part 2: Creating batches
Now that we have preprocessed our input data, we then need to partition our data, here we will use mini-batches to train our model, so how will we define our batches?

Let's first clarify the concepts of batches:
1. **batch_size**: Reviewing batches in CNN, if we have 100 samples and we set batch_size as 10, it means that we will send 10 samples to the network at one time. In RNN, batch_size have the same meaning, it defines how many samples we send to the network at one time.
2. **sequence_length**: However, as for RNN, we store memory in our cells, we pass the information through cells, so we have this sequence_length concept, which also called 'steps', it defines how long a sequence is.

From above two concepts, we here clarify the meaning of batch_size in RNN. Here, we define the number of sequences in a batch as N and the length of each sequence as M, so batch_size in RNN **still** represent the number of sequences in a batch but the data size of a batch is actually an array of size **[N, M]**.

<span style="color:red">TODO:</span>
finish the get_batches() function below to generate mini-batches.

Hint: this function defines a generator, use *yield*.

In [5]:
def get_batches(array, n_seqs, n_steps):
    '''
    Partition data array into mini-batches
    input:
    array: input data
    n_seqs: number of sequences in a batch
    n_steps: length of each sequence
    output:
    x: inputs
    y: targets, which is x with one position shift
       you can check the following figure to get the sence of what a target looks like
    '''
    batch_size = n_seqs * n_steps
    n_batches = int(len(array) / batch_size)
    # we only keep the full batches and ignore the left.
    array = array[:batch_size * n_batches]
    array = array.reshape((n_seqs, -1))
    
    # You should now create a loop to generate batches for inputs and targets
    #############################################
    #           TODO: YOUR CODE HERE            #
    #############################################
    while True:
        np.random.shuffle(array)
        for n in range(0, array.shape[1], n_steps):
            x = array[:, n:n + n_steps]
            y = np.zeros_like(x)
            y[:, :-1], y[:, -1] = x[:, 1:], x[:, 0]
            yield x,y
    

In [6]:
batches = get_batches(text_as_int, 10, 10)
x, y = next(batches)
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[50 58 57  1 51 39 63  1 57 46]
 [18 47 56 57 58  1 15 47 58 47]
 [ 1 43 52 43 51 63 11  0 37 43]
 [52 58 43 42  1 60 47 56 58 59]
 [46 47 51  1 42 53 61 52  1 58]
 [ 1 40 43 43 52  1 57 47 52 41]
 [57 47 53 52  1 53 44  1 56 43]
 [47 52  1 57 54 47 58 43  1 53]
 [56 44 53 50 49  6  0 27 52  1]
 [56 57  6  1 39 52 42  1 57 58]]

y
 [[58 57  1 51 39 63  1 57 46 50]
 [47 56 57 58  1 15 47 58 47 18]
 [43 52 43 51 63 11  0 37 43  1]
 [58 43 42  1 60 47 56 58 59 52]
 [47 51  1 42 53 61 52  1 58 46]
 [40 43 43 52  1 57 47 52 41  1]
 [47 53 52  1 53 44  1 56 43 57]
 [52  1 57 54 47 58 43  1 53 47]
 [44 53 50 49  6  0 27 52  1 56]
 [57  6  1 39 52 42  1 57 58 56]]


## Part 3: Build Char-RNN model
In this section, we will build our char-rnn model, it consists of input layer, rnn_cell layer, output layer, loss and optimizer, we will build them one by one.

The goal is to predict new text after given prime word, so for our training data, we have to define inputs and targets, here is a figure that explains the structure of the Char-RNN network.

![structure](img/charrnn.jpg)

<span style="color:red">TODO:</span>
finish all TODOs in ecbm4040.CharRNN and the blanks in the following cells.

**Note: The training process on following settings of parameters takes about 20 minutes on a GTX 1070 GPU, so you are suggested to use GCP for this task.**

In [8]:
from CharRNN import *

### Training
Set sampling as False(default), we can start training the network, we automatically save checkpoints in the folder /checkpoints.

In [9]:
# these are preset parameters, you can change them to get better result
batch_size = 100         # Sequences per batch
num_steps = 100          # Number of sequence steps per batch
rnn_size = 256           # Size of hidden layers in rnn_cell
num_layers = 2           # Number of hidden layers
learning_rate = 0.005    # Learning rate

In [91]:
model = CharRNN(len(vocab), batch_size, num_steps, 'LSTM', rnn_size,
               num_layers, learning_rate)
batches = get_batches(text_as_int, batch_size, num_steps)
model.train(batches, 6000, 2000)

step: 200  loss: 2.1357  0.2098 sec/batch
step: 400  loss: 1.8421  0.2067 sec/batch
step: 600  loss: 1.6938  0.2090 sec/batch
step: 800  loss: 1.6324  0.2070 sec/batch
step: 1000  loss: 1.6613  0.2091 sec/batch
step: 1200  loss: 1.5557  0.2088 sec/batch
step: 1400  loss: 1.5540  0.2076 sec/batch
step: 1600  loss: 1.5051  0.2045 sec/batch
step: 1800  loss: 1.4805  0.2047 sec/batch
step: 2000  loss: 1.4816  0.2079 sec/batch
step: 2200  loss: 1.4766  0.2048 sec/batch
step: 2400  loss: 1.4170  0.2060 sec/batch
step: 2600  loss: 1.4420  0.2111 sec/batch
step: 2800  loss: 1.4236  0.2084 sec/batch
step: 3000  loss: 1.4270  0.2036 sec/batch
step: 3200  loss: 1.4300  0.2069 sec/batch
step: 3400  loss: 1.4077  0.2095 sec/batch
step: 3600  loss: 1.4409  0.2076 sec/batch
step: 3800  loss: 1.3666  0.2087 sec/batch
step: 4000  loss: 1.4060  0.1993 sec/batch
step: 4200  loss: 1.3896  0.2030 sec/batch
step: 4400  loss: 1.3794  0.2090 sec/batch
step: 4600  loss: 1.3901  0.2036 sec/batch
step: 4800  los

In [92]:
# look up checkpoints
tf.train.get_checkpoint_state('checkpoints')

model_checkpoint_path: "checkpoints/i6000_l256.ckpt"
all_model_checkpoint_paths: "checkpoints/i2000_l256.ckpt"
all_model_checkpoint_paths: "checkpoints/i4000_l256.ckpt"
all_model_checkpoint_paths: "checkpoints/i6000_l256.ckpt"

### Sampling
Set the sampling as True and we can generate new characters one by one. We can use our saved checkpoints to see how the network learned gradually.

In [93]:
model = CharRNN(len(vocab), batch_size, num_steps,'LSTM', rnn_size,
               num_layers, learning_rate, sampling=True)
# choose the last checkpoint and generate new text
checkpoint = tf.train.latest_checkpoint('checkpoints')
samp = model.sample(checkpoint, 1000, len(vocab), vocab_to_ind, ind_to_vocab, prime='LORD ')
print(samp)

INFO:tensorflow:Restoring parameters from checkpoints/i6000_l256.ckpt
['L' 'O' 'R' ... 't' 'h' 'e']


In [94]:
for i in samp:
    print(i,end='')

LORD these sealst
And towned at your power, and as you dear,
To be by honour of thyself were so.

KATHARINA:
I'll save, sir! to hear her and borts
Than's mouged throops: who thoughts he will not stand,
This is the place and though the seat of sigh storm
Talked for sordlus things, thou and the sunder
Of honour there to bear me at him to-norder.

GLOUCESTER:
Wyaved soul shall spate the stear'n trurceit of thee.
The lorder to tantuadigrument,
Be sorry to my sperits, to be so.
I say then the words, something strange and tender.

GLOUCESTER:
What so then I am trances on them as her friar?
This sirk am that within, I whould river on,
As thou shouldst strew subject that a them on to him.

CLIFFORD:
Ay, this's too, brief him what thou hast boughtly,r' words.

CAMILLO:
To be to mean to say her chambicger men.

Provost:
What, to the tears that I have houes in that
I have to seal a corit of his bowt seems sovereign,
I must be so too, till to bade these love.

GLOUCESTER:
Thy marky stredce there a

In [97]:
# choose a checkpoint other than the final one and see the results. It could be nasty, don't worry!
#############################################
#           TODO: YOUR CODE HERE            #
#############################################
model = CharRNN(len(vocab), batch_size, num_steps,'LSTM', rnn_size,
               num_layers, learning_rate, sampling=True)
checkpoint = "checkpoints/i2000_l256.ckpt"
samp = model.sample(checkpoint, 1000, len(vocab), vocab_to_ind, ind_to_vocab, prime="LORD ")
print(samp)

INFO:tensorflow:Restoring parameters from checkpoints/i2000_l256.ckpt
['L' 'O' 'R' ... ' ' 't' 'h']


In [98]:
for i in samp:
    print(i,end='')

LORD and then,
The holdird of your silk off and hat to me.

KONH HENRY 'T 
We will she's save a person to much
Worliep of a stand.

COMINIUS:
Ay, beater you a wife to save them blded.

MISRRASLANC:
If you do no more an enome, and thy command shoods.

GLOUCESTER:
As it send them so barren of my hang,
The sin of this some son we have the part of
hen all so dare'd; therefore, that I, the hardes should have.--
The day it should be so houseness. I'll stain and heart out
For her the fount a capt of hall is now the
concersed warmarit to him, in the morning
Wing thee not shall be he a shepter, and burthen
thisers is thrings to thought thy here, sir, bry thee no
confestion: bear a mother's prayers of that?

CORINA:
He would not speak it, well to have some holy menes.

PETRUCHIO:
One as I am with more other sad too, he'se the
wood traison
And seem on those and strange, and stand to lia
To should so more thee here. 
MENENIUS:
Thy sine as art to the more merd, bedore.

POMPEY:
Went to my love, ind

### Change another type of RNN cell
We are using LSTM cell as the original work, but GRU cell is getting more popular today, let's chage the cell in rnn_cell layer to GRU cell and see how it performs. Your number of step should be the same as above.

**Note: You need to change your saved checkpoints' name or they will rewrite the LSTM results that you have already saved.**

In [99]:
# these are preset parameters, you can change them to get better result
batch_size = 100         # Sequences per batch
num_steps = 100          # Number of sequence steps per batch
rnn_size = 256           # Size of hidden layers in rnn_cell
num_layers = 2           # Number of hidden layers
learning_rate = 0.005    # Learning rate

model = CharRNN(len(vocab), batch_size, num_steps, 'GRU', rnn_size,
               num_layers, learning_rate)
batches = get_batches(text_as_int, batch_size, num_steps)
model.train(batches, 6000, 2000)

step: 200  loss: 2.3189  0.1972 sec/batch
step: 400  loss: 1.9622  0.1946 sec/batch
step: 600  loss: 1.7934  0.1942 sec/batch
step: 800  loss: 1.7167  0.1926 sec/batch
step: 1000  loss: 1.7364  0.1967 sec/batch
step: 1200  loss: 1.6092  0.1934 sec/batch
step: 1400  loss: 1.5947  0.1899 sec/batch
step: 1600  loss: 1.5388  0.1984 sec/batch
step: 1800  loss: 1.5021  0.1947 sec/batch
step: 2000  loss: 1.5135  0.1939 sec/batch
step: 2200  loss: 1.4950  0.2016 sec/batch
step: 2400  loss: 1.4478  0.1950 sec/batch
step: 2600  loss: 1.4506  0.1888 sec/batch
step: 2800  loss: 1.4231  0.1963 sec/batch
step: 3000  loss: 1.4404  0.1977 sec/batch
step: 3200  loss: 1.4371  0.1953 sec/batch
step: 3400  loss: 1.4168  0.2040 sec/batch
step: 3600  loss: 1.4430  0.1972 sec/batch
step: 3800  loss: 1.3672  0.1949 sec/batch
step: 4000  loss: 1.4248  0.1948 sec/batch
step: 4200  loss: 1.3980  0.1957 sec/batch
step: 4400  loss: 1.3859  0.1975 sec/batch
step: 4600  loss: 1.3882  0.1932 sec/batch
step: 4800  los

In [104]:
model = CharRNN(len(vocab), batch_size, num_steps, 'GRU', rnn_size,
               num_layers, learning_rate, sampling=True)
# choose the last checkpoint and generate new text
checkpoint = tf.train.latest_checkpoint('checkpoints')
samp = model.sample(checkpoint, 1000, len(vocab), vocab_to_ind, ind_to_vocab, prime="LORD ")
print(samp)

INFO:tensorflow:Restoring parameters from checkpoints/i6000_l256.ckpt
['L' 'O' 'R' ... 'h' 'o' 'u']


In [105]:
for i in samp:
    print(i,end='')

LORD and tears
Or that thou layst and thought, and tell into my lage,
It may be bring to some mind and to my lark unto
Our truth of his contemth buckle farment too;
Teans and the strengmed and mercy in hist title and
What should I be my life of him served thing to me arrink,
I have men in the mighty sovereign low.

HORTENSIO:
The care of the statorury of the bought
Of daughal far a brother, with the fall such any here!

DUKE OF AUMENOESBELLE: I have I make her:
To her and fortune's hand is so bear too say.
I say ' of trinustuly my honour is!
O than men! I would shame me fair, but served
In thy mutis a terdal. Thy miful through torture
Being sending be corn friends. But who darly side?

BAPOIABLALdENCE:
Went a pardon a cale, sir, and is not the leisure.

HORTENSIO:
To how there will straint, my too mass light, they,
And to my house, my liege, thy loves, taught me,
Is son a wook, to make o desert from him,
To make me speak wilm out of my life;
And art I will not, for my father should;
An

#### Questions
1. Compare your result of two networks that you built and the reasons that caused the difference. (It is a qualitative comparison, it should be based on the specific model that you build.)
2. Discuss the difference between LSTM cells and GRU cells, what are the pros and cons of using GRU cells?

Answer:
**
1.Comparison:
GRU is faster than LSTM. It took LSTM more than 0.2 sec to train a batch, while for GRU it is less than 0.2 sec. But for the result, LSTM is better than GRU. Those words LSTM made make more sense than GRU's.

2.The difference:
GRU has 2 gates: update gate z and reset gate r; LSTM has 3: input i, forget f, and output o.
In the LSTM unit, the amount of the memory content that is seen, or used by other units in the network is controlled by the
output gate. On the other hand the GRU exposes its full content without any control.
Another difference is in the location of the input gate, or the corresponding reset gate. The LSTM
unit computes the new memory content without any separate control of the amount of information
flowing from the previous time step. Rather, the LSTM unit controls the amount of the new memory
content being added to the memory cell independently from the forget gate. On the other hand, the
GRU controls the information flow from the previous activation when computing the new, candidate
activation, but does not independently control the amount of the candidate activation being added.

pros of GRU: GRU have fewer parameters and thus may train a bit faster or need less data to generalize.
cons of GRU: with large data, the LSTMs with higher expressiveness may lead to better results**