### Making an RNN  learn to Generate English Text ###
- an RNN  is trained in seq2seq manner to make it learn to generate text
- with lots of text fed to the network it models the language
- The text corpus is split into chunks of fixed length 
- Each character is represented using a correspodning one hot vector
- it learns to model the conditional probability of having a character as next character, given its previous N characters
- This code does the unrolling of RNN explicitly using a for loop, to demosntrate how hidden state (output of hidden layer) is carrried forward to the next time-step 


## Notations

Throughout this walkthrough, some notations are used for variable names inorder to make code concise. Please get familiar with the notations before you start.

The model we're looking to obtain is `g(x)`, and what we obtain after training is `f(x)`. `x` is most obviously input to the function we're looking to learn.

We have:

```python
sequences, predictions, targets = x, y, z = x, f(x), g(x)
hidden state = h

In [1]:
import torch 
from torch import nn, optim
from torch.autograd import Variable
from tqdm import tqdm
from random import randint, shuffle
import string


In [2]:
use_cuda = torch.cuda.is_available()
if use_cuda:
    print ('CUDA is available')
#use_cuda=False # uncomment this if you want to run on CPU alone

CUDA is available


## Embedding

We use one-hot vectors to represent each character. This is like a bitstring and for each character we obtain a vector with a single one and remaining zeroes masking using the index of the specific character in the alphabet we want to use.

One hot vectors are a rather *sparse* representation, which help us blow up the feature space from something of a lesser dimension. In place of of a one hot vector, you can also use an `Embedding` layer, which takes in a real value (index of char in the alphabet) and learns a *dense* representation of the input space during training, this may bring closer characters together in the resulting feature space.

We use the below class to store the alphabet and generate mappings to encode from `char -> onehot` and `onehot -> char`.

```haskell
OneHotEmbedding.Init :: alphabet -> None
OneHotEmbedding.N :: None -> Int
OneHotEmbedding.encode :: Char -> onehot vector
OneHotEmbedding.label :: Char -> Int
OneHotEmbedding.inverse_label :: Int -> Char

```

In [3]:
# helper function to make one hot embedding when the alphabet is provided 
#alphabet is the set of uniq characters in your language
class OneHotEmbedding:
    def __init__(self, alphabet):
        self.alphabet = alphabet
        self.inverse_map = dict(enumerate(alphabet))
        self.map = dict(zip(alphabet, range(len(alphabet))))

    def N(self):
        return len(alphabet)

    def encode(self, x):
        # T x B x H = len(x) x 1 x N
        v = torch.FloatTensor(self.N()).zero_()
        v[self.map[x]] = 1
        return v
    
    def label(self, x):
        return torch.LongTensor([self.map[x]])

    def inverse_label(self, x):
        return self.inverse_map[x]

    def decode(self, y):
        _, max_probs = torch.max(y.transpose(0, 1), 2)
        max_probs = max_probs.squeeze()
        return self.inverse_classes(max_probs)


### Modelling language modelling as a sequence to sequence learning ###
![char-rnn seq2seq](charrnn.png)


- Input and Target sequences are sequences of characters one shifted in postion
- For example if your corpus is "cvit summer school" and your chunk_len=4,
    - then the first chunk ="cvit" . 
    - Input sequence will be "cvi" and 
    - target is "vit"
- We then convert each character in your input and target sequence of characters to one hot vectors. Eeach one hot vector will be of size= your alphabet size
- From the input sequence and target sequence, we take a pair of input and target and one hot vectors and feed to the network. 
    - At each instance we calculate the loss
    - Once all the timesteps are processed, sum of losses is calculated
    - Now we backpropagate the error 
    
.

```haskell
Network.forward :: x(t), h(t-1) -> y(t), h(t)
```

Inorder to better understand by manipulating the hidden states, we're building the module so that we can see the hidden state being used explicitly. 

We're using a `GRU`, you can substitute it with an `RNN` or an `LSTM`, with the required parameters. For an `LSTM`, you'll have to additionally manipulate the cell state in the forward pass.

In [4]:
# Model Def
class Network(nn.Module):
    def __init__(self, **kw):
        super(Network, self).__init__()
        self.input_size = kw['input_size']
        self.hidden_size = kw['hidden_size']
        self.output_size = kw['output_size']
        self.n_layers = kw['n_layers']

        self.fc_in = nn.Linear(self.input_size, self.hidden_size)
        self.rnn = nn.GRU(self.hidden_size, self.hidden_size, self.n_layers)
        self.fc_out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, x, h):
        # One hot vector of single column coming in. 
        # View sorcery is to adjust to the layer's dimension requirement
        # Size(D) -> Size(1,D)

        x = self.fc_in(x.view(1, -1))

        # Mimicking TxBxD, required by RNN.
        # h(t-1) in, h(t) out.
        x, h = self.rnn(x.view(1, 1, -1), h)

        x = self.fc_out(x.view(1, -1))
        return x, h

    def init_hidden(self):
        return Variable(torch.zeros(self.n_layers, 1, self.hidden_size))


## Prepare for Training
We prepare the data, generate the alphabet from the data and use it to initialize the `OneHotEmbedding` and `Network`.
Almost all hyperparameters are defined here.

In [5]:
printable=string.printable
#reads the text, make everything lower ( so that we will have lower class labels)
# and removes non printable characters from the corpus
text = open("../../../data/lab2/sh.txt").read().lower()

pruned_text = ''
for c in text:
    if c in printable and c not in '{}[]&_':
        pruned_text += c
text = pruned_text
alphabet = list(set(list(text)))

print ('size of your alphabet =', len(alphabet))
print ('your alphabet is =', alphabet)

onehot = OneHotEmbedding(alphabet)
chunk_size = 128

('size of your alphabet =', 52)
('your alphabet is =', ['\n', '!', ' ', '"', "'", ')', '(', '*', '-', ',', '/', '.', '1', '0', '3', '2', '5', '4', '7', '6', '9', '8', ';', ':', '?', 'a', '`', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z'])


In [6]:

## 
batch_length = 64
hidden_size = 100
n_layers = 1
#input and output sizes are =len(alphabet) = onehot.N(). 
net = Network(input_size=onehot.N(), hidden_size=hidden_size, output_size=onehot.N(), n_layers=n_layers)
criterion = nn.CrossEntropyLoss()
learning_rate = 5e-3
optimizer = optim.Adam(net.parameters(), learning_rate)

if use_cuda:
    net=net.cuda()
    criterion=criterion.cuda()
epoch = 0


## Generating Text

The following function samples the output from a distribution and helps in predicting new sequences.

In [7]:
def generate(**kw):
    result = kw['prime']
    net=kw['net']
    onehot=kw['onehot']

    h = net.init_hidden()
    if use_cuda:
        h = h.cuda()

    x = None

    for char in result:
        x = onehot.encode(char)
        if use_cuda:
            x=x.cuda()
        x = Variable(x, requires_grad=False)
        y, h = net(x, h)

        
    for p in range(kw["length"]):
        y, h = net(x, h)
        y_dist = y.data.view(-1).div(kw["temperature"]).exp()
        argmax = torch.multinomial(y_dist, 1)[0]

        prediction = onehot.inverse_label(argmax)
        result += prediction
        x = onehot.encode(prediction)
        if use_cuda:
            x=x.cuda()
        x = Variable(x, requires_grad=False)
    return result


# Training 

We train one *chunk* at a time, which denotes a chunk of a text of length `chunk_size`. 

For example, consider the text **I've been carrying this thing since 2008.** *(Happy Hogan,Spiderman Homecoming, 2017)*, and a chunk length of `3`. The training loop, if you unroll looks as follows:

```python
# <I'v>, <'ve> = x, z -------

"'", h_1 = net("I", h_0)
"v", h_2 = net("'", h_1)
"e", h_3 = net("v", h_2)

# <e b>, < be> = x, z -------

" ", h_4 = net("e", h_3)
"b", h_5 = net(" ", h_4)
"e", h_6 = net("b", h_5)

```

So, if you notice the hidden state is carried throughout the text. And as the training progresses, it picks up more and more patterns in the language and is able to generate new text, by sampling from the output distribution.

In [8]:
for j in range(100):
        # Hidden Layer Initialized only at start.
        # Needs to be carried throughout the text.
        h = net.init_hidden()
        if use_cuda:
            h = h.cuda()
        for k, i in enumerate(range(0, len(text)-chunk_size, chunk_size)):
            chunk = text[i:i+chunk_size+1]
            xs, zs = chunk[:-1], chunk[1:]

            loss = 0

            net.zero_grad()
            # Iterate through each character -> next character mapping
            # Carrying hidden state forward.
            for x, z in zip(xs, zs):
                x = onehot.encode(x)
                z = onehot.label(z)
                if use_cuda:
                    x = x.cuda()
                    z = z.cuda()
                x = Variable(x, requires_grad=False)
                z = Variable(z)
                y, h = net(x, h)
                loss += criterion(y.view(1, -1), z)
            


            # Saving h again, so it's not consumed by .backward() ahead.
            h = h.data
            if use_cuda:
                h = h.cuda()
            h = Variable(h, requires_grad=True)

            loss.backward()
            optimizer.step()

            #print("Loss: ", loss.data[0]/len(xs))
            if k % 50 == 0:
                new = generate(prime='elementary my dear watson'.lower(), net=net, onehot=onehot, temperature=0.8, length=100)
                print("----- Generated %d: --------------\n"%(k), new)
            if k%5000 ==0:
                kstring=str(k)
                jstring=str(j)
                #torch.save(net, 'char_rnn_stateful_onehot_'+jstring+'_'+kstring+'.pt')


('----- Generated 0: --------------\n', 'elementary my dear watsonedfv5;b(98.6m3ta.a.p;dim4, zoae\nzdp:0,5v9"f4kzn;98\n6)m??`"*0*) 7hxhz (hrix"\n :tnev(tkt5ydx/iv6.d):bi')
('----- Generated 50: --------------\n', 'elementary my dear watsonthesan  weredrr sycen  i       d ilt  asl thonoa.comennse otr         aidthed                wa isoa')


KeyboardInterrupt: 

In [None]:
## loading a pretrained model #
use_cuda=True

net=torch.load('../../../data/lab2/char_rnn_stateful_onehot_0_25000.pt')


#### DO NOT LOAD THE MODEL MULTIPLE TIMES, IT WILL THROW A CUDA DEVICE ERROR
#IF YOU ENCOUNTER SUCH AN ERROR YOU MUST RESTART THE KERNEL ##

In [None]:
## now test on this model ####
generatedText= generate(prime='elementary my dear watson'.lower(), temperature=0.1, length=200, net=net, onehot=onehot)
print (generatedText)

### Exercise 1 ###
1. Whats the dimension(tensor.size()) of the variable 'h' in the above code snippet. ? Why is it fed to the network (<i>net(x, h)</i>), along with the the input( x in our case). [hint:- Notice that the network also returns an h everytime]
2. For how long does the hidden state is carried forward during training. 
    - A. it is carried forward from one time step to another, within a sequence. But not from last time step in a sequence to the first timestep of the next sequencce
    - B. Not just across time steps within a sequence it is carried forward from one sequence to another
    - C. It is carried forward all throughout the training. 
3. For what value of T is the sampling equivalent to doing an argmax (or picking the most probable label) sampling
4. Vary the value of T and see how the text generated varies
5. While using the saved model, find out the most proable character the network would predict if your input seed is "holme"

### Exercise 2###

1. In the above code the learning is modelled as seq2seq problem. Your input is a sequence of characters and target is another sequence of characters. Which essentially means you have a target at each time step of the sequence. But this problem of text generation can also be modelled as a sequence to one problem. Then input would be sequence and target is just the next_char in the sequence. Can you modify the code to do this? ( Remember that since it is sequence to one, the output of the hidden layer need to be fed to the output layer only at the last time step)
2. We were using 'non over lapping chunks' in the above case. For example if first chunk 1234 then next one was 5678 and so on. How would the results differ if we take  overlapping chunks like 1234 is the fist chunk next is 2345 ?
3. Also can you modify the code in such a way that the hidden state is not retained from chunk to another ?
4. Try using MSE loss for the above problem. How does the network converge with an MSE loss? Why did MSE perfrom poorer or better?
