### Making an RNN  learn to Generate English Text ###
- an RNN  is trained in seq2seq manner to make it learn to generate text
- with lots of text fed to the network it models the language
- The text corpus is split into chunks of fixed length 
- Each character is represented using an index
- it learns to model the conditional probability of having a character as next character, given its previous N characters
- This code does the unrolling of RNN explicitly using a for loop, to demosntrate how hidden state (output of hidden layer) is carrried forward to the next time-step 


<b>Acknowledgement :</b>- This code is almost completely copied from here https://gist.github.com/michaelklachko?direction=desc&sort=updated . 

In [1]:
import string
import random
import torch
import torch.nn as nn
from torch.autograd import Variable
import time, math
 
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True

use_cuda = torch.cuda.is_available()

if use_cuda:
    print ('CUDA is available')
#use_cuda=False   #uncomment this if you dont want to use cuda variables

CUDA is available


In [2]:
printable = string.printable
 
#Input text is available here: https://sherlock-holm.es/stories/plain-text/cano.txt
text = open('../../../data/lab2/sh.txt', 'r').read().lower()



## remove non printable chars and other unnecessary punctuations
pruned_text = ''
 
for c in text:
	if c in printable and c not in '{}[]&_':
		pruned_text += c
 
text = pruned_text		  
file_len = len(text)
alphabet = sorted(list(set(text)))
n_chars = len(alphabet)

print "\nTraining RNN on The Complete Sherlock Holmes.\n"		 
print "\nFile length: {:d} characters\nUnique characters: {:d}".format(file_len, n_chars)
print "\nUnique characters:", alphabet		 
print ('no of uniq chars', n_chars)


Training RNN on The Complete Sherlock Holmes.


File length: 3867934 characters
Unique characters: 52

Unique characters: ['\n', ' ', '!', '"', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
('no of uniq chars', 52)


In [3]:
def time_since(since):
	s = time.time() - since
	m = math.floor(s / 60)
	s -= m * 60
	return '%dm %ds' % (m, s)
 

In [4]:
def random_chunk():
    start = random.randint(0, file_len - chunk_len)
    end = start + chunk_len + 1
    return text[start:end]

def chunk_vector(chunk):
    vector = torch.zeros(len(chunk)).long()
    for i, c in enumerate(chunk):
        vector[i] = alphabet.index(c)  #construct ASCII vector for chunk, one number per character
    return Variable(vector.cuda(), requires_grad=False) 


In [5]:
def random_training_batch():
    inputs = []
    targets = []
    #construct list of input vectors (chunk_len):
    for b in range(batch_size):    
        chunk = random_chunk()
        inp = chunk_vector(chunk[:-1])
        target = chunk_vector(chunk[1:])
        inputs.append(inp)
        targets.append(target)
    #construct batches from lists (chunk_len, batch_size):
    #need .view to handle batch_size=1
    #need .contiguous to allow .view later
    inp = torch.cat(inputs, 0).view(batch_size, chunk_len).t().contiguous()
    target = torch.cat(targets, 0).view(batch_size, chunk_len).t().contiguous()
    return inp, target

### Modelling language modelling as a sequence to sequence learning ###
![char-rnn seq2seq](charrnnembed.png)


- Input and Target sequences are sequences of characters one shifted in postion
- For example if your corpus is "cvit summer school" and your chunk_len=4,
    - then the first chunk ="cvit" . 
    - Input sequence will be "cvi" and 
    - target is "vit"
 
    
.

```haskell
Network.forward :: x(t), h(t-1) -> y(t), h(t)
```

Inorder to better understand by manipulating the hidden states, we're building the module so that we can see the hidden state being used explicitly. 

We're using a `GRU`, you can substitute it with an `RNN` or an `LSTM`, with the required parameters. For an `LSTM`, you'll have to additionally manipulate the cell state in the forward pass

In [6]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, n_layers):
        super(RNN, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers 
        self.batch_size = batch_size
        
        self.encoder = nn.Embedding(input_size, hidden_size) #first arg is dictionary size
        self.GRU = nn.GRU(hidden_size, hidden_size, n_layers)  #(input_size, hidden_size, n_layers)
        self.decoder = nn.Linear(hidden_size, output_size)
        
    def forward(self, input, hidden, batch_size):
        #expand input vector length from single number to hidden_size vector
        #Input: LongTensor (batch, seq_len)
        #Output: (batch, seq_len, hidden_size)
        input = self.encoder(input.view(batch_size, seq_len)) 
        #need to reshape Input to (seq_len, batch, hidden_size)
        input = input.permute(1, 0, 2)
        #Hidden (num_layers * num_directions, batch, hidden_size), num_directions = 2 for BiRNN
        #Output (seq_len, batch, hidden_size * num_directions)
        output, hidden = self.GRU(input, hidden) 
        #output, hidden = self.GRU(input.view(seq_len, batch_size, hidden_size), hidden) 
        #Output becomes (batch, hidden_size * num_directions), seq_len=1 (single char)
        output = self.decoder(output.view(batch_size, hidden_size))  
        #now the output is (batch_size, output_size)
        return output, hidden
    
    def init_hidden(self, batch_size):
        #Hidden (num_layers * num_directions, batch, hidden_size), num_directions = 2 for BiRNN
        return Variable(torch.randn(self.n_layers, batch_size, self.hidden_size).cuda())

        


In [10]:

seq_len = 1        #each character is encoded as a single integer
chunk_len = 128    #number of characters in a single text sample
batch_size = 16   #number of text samples in a batch
n_batches = 200   #size of training dataset (total number of batches)
hidden_size = 256  #width of model
n_layers = 2      #depth of model
LR = 0.005         #learning rate

#net = RNN(n_chars, hidden_size, n_chars, n_layers).cuda()
net = RNN(n_chars, hidden_size, n_chars, n_layers).cuda()
optim = torch.optim.Adam(net.parameters(), LR)
cost = nn.CrossEntropyLoss().cuda()  

print "\nModel parameters:\n"
print "n_batches: {:d}\nbatch_size: {:d}\nchunk_len: {:d}\nhidden_size: {:d}\nn_layers: {:d}\nLR: {:.4f}\n".format(n_batches, batch_size, chunk_len, hidden_size, n_layers, LR)
print "\nRandom chunk of text:\n\n", random_chunk(), '\n'
    
"""
Take input, target pairs of chunks (target is shifted forward by a single character)
convert them into chunk vectors
for each char pair (i, t) in chunk vectors (input, target), create embeddings with dim = hidden_size
feed input char vectors to GRU model, and compute error = output - target
update weights after going through all chars in the chunk
"""


Model parameters:

n_batches: 200
batch_size: 16
chunk_len: 128
hidden_size: 256
n_layers: 2
LR: 0.0050


Random chunk of text:

nce of a
     solitary swan. holmes gazed at it and then passed on to the lodge
     gate. there he scribbled a short note for st 



'\nTake input, target pairs of chunks (target is shifted forward by a single character)\nconvert them into chunk vectors\nfor each char pair (i, t) in chunk vectors (input, target), create embeddings with dim = hidden_size\nfeed input char vectors to GRU model, and compute error = output - target\nupdate weights after going through all chars in the chunk\n'

In [13]:
def evaluate(prime_str = 'a', predict_len = 100, temp = 0.8, batch_size = 1):
    hidden = net.init_hidden(batch_size) 
    prime_input = chunk_vector(prime_str)
    predicted = prime_str
    
    for i in range(len(prime_str)-1):
        _, hidden = net(prime_input[i], hidden, batch_size)
     
    inp = prime_input[-1]
    
    for i in range(predict_len):
        output, hidden = net(inp, hidden, batch_size)
        output_dist = output.data.view(-1).div(temp).exp()  
        top_i = torch.multinomial(output_dist, 1)[0]
        
        predicted_char = alphabet[top_i]
        predicted +=  predicted_char
        inp = chunk_vector(predicted_char)

    return predicted



In [16]:

 
start = time.time()

training_set = []

for i in range(n_batches):
    training_set.append((random_training_batch()))

i = 0    
for inp, target in training_set:
    #re-init hidden outputs, zero grads, zero loss:
    hidden = net.init_hidden(batch_size)
    net.zero_grad()
    loss = 0        
    #for each char in a chunk:
    #compute output, error, loss:
    for c, t in zip(inp, target):
        output, hidden = net(c, hidden, batch_size)
        loss += cost(output, t)
    #calculate gradients, update weights:
    loss.backward()
    optim.step()

    if i % 100 == 0:
        print "\n\nSample output:\n"
        print evaluate('wh', 100, 0.8), '\n'
        print('[%s (%d / %d) loss: %.4f]' % (time_since(start), i, n_batches, loss.data[0] / chunk_len))

    i += 1      



Sample output:

what sleft with before
     having on me and you all-comment a more holme, and the room to cab strange 

[0m 1s (0 / 200) loss: 1.5013]


KeyboardInterrupt: 

### Exercise 1 ###
1. Why do you have to take the hidden state from the network each time and pass it along with the next input
2. For how long does the hidden state is carried forward during training. 
    - A. it is carried forward from one time step to another, within a sequence. But not from last time step in a sequence to the first timestep of the next sequencce
    - B. Not just across time steps within a sequence it is carried forward from one sequence to another
    - C. It is carried forward all throughout the training. 
3. For what value of T is the sampling equivalent to doing an argmax (or picking the most probable label) sampling
4. Vary the value of T and see how the text generated varies


### Exercise 2###

1. In the above code the learning is modelled as seq2seq problem. Your input is a sequence of characters and target is another sequence of characters. Which essentially means you have a target at each time step of the sequence. But this problem of text generation can also be modelled as a sequence to one problem. Then input would be sequence and target is just the next_char in the sequence. Can you modify the code to do this? ( Remember that since it is sequence to one, the output of the hidden layer need to be fed to the output layer only at the last time step)


4. Try using MSE loss for the above problem. How does the network converge with an MSE loss? Why did MSE perfrom poorer or better?
