In this section we will use RNN for a very simple task, which is character level language model. This is a many to many example of RNN.

**WorkFLow**:
  - feed a sequence of characters into RNN
  - at every time step we will ask the RNN to predict the next word.

Let's say, we have a training sequence of just one string "hello" and we have a vocabulary $V \in \{\text{"h"}, \text{"e"}, \text{"l"}, \text{"o"}\}$

### Work procedure:
 we know the basic of RNN and its equation from the previous section. Now we will make one hot vector for each corresponding letter in the vocabulary as follows.

![alt text](images/5.PNG)

Then we will feed into RNN a single vecton at once at a single time step. Initially the $h_0$ is an empty vector of size 3 in this case. Now applying the RNN function with the same weight matrix we get as follows,

$\begin{aligned}
\begin{bmatrix}0.3 \\ -0.1 \\ 0.9 \end{bmatrix} &= f_W(W_{hh}\begin{bmatrix}0 \\ 0 \\ 0 \end{bmatrix} + W_{xh}\begin{bmatrix}1 \\ 0 \\ 0 \\ 0 \end{bmatrix}) \ \ \ \ &(1) \\
\begin{bmatrix}1.0 \\ 0.3 \\ 0.1 \end{bmatrix} &= f_W(W_{hh}\begin{bmatrix}0.3 \\ -0.1 \\ 0.9 \end{bmatrix} + W_{xh}\begin{bmatrix}0 \\ 1 \\ 0 \\ 0 \end{bmatrix}) \ \ \ \ &(2) \\
\begin{bmatrix}0.1 \\ -0.5 \\ -0.3 \end{bmatrix} &= f_W(W_{hh}\begin{bmatrix}1.0 \\ 0.3 \\ 0.1 \end{bmatrix} + W_{xh}\begin{bmatrix}0 \\ 0 \\ 1 \\ 0 \end{bmatrix}) \ \ \ \ &(3) \\
\begin{bmatrix}-0.3 \\ 0.9 \\ 0.7 \end{bmatrix} &= f_W(W_{hh}\begin{bmatrix}0.1 \\ -0.5 \\ -0.3 \end{bmatrix} + W_{xh}\begin{bmatrix}0 \\ 0 \\ 1 \\ 0 \end{bmatrix}) \ \ \ \ &(4)
\end{aligned}$

![alt text](images/cg6.PNG)



**Embedding layer**: It is a layer that converts categorical data, words or characters into dense vectors of real numbers. This process of representing categorical datat as continuous vectors is known as embedding. To convert a one hot vector into embedding layer we need to take an embedding matrix and then perfrom matrix multiplication between this matrix and the one hot vector.

We can also predict the next character at each timestep. Since there are 4 characters in the vocabulary, predicted vector at each time step is 4 dimensional.

![alt text](images/6.PNG)

At first we fed "h" in the network. RNN predicts next charcter at this time step with current settings of weights as,
$\begin{bmatrix}1.0 \\ 2.2 \\ -3.0 \\ 4.1 \end{bmatrix} \rightarrow \begin{bmatrix}\text{"h"} \\ \text{"e"} \\ \text{"l"}\\ \text{"o"} \end{bmatrix}$


Here, the scores refers to the corresponding characters in the vocabulary. For example 1.0 for "h" and so on. But the prediction of each time step will be the character corresponding to the highest score. In this case highest score value is 4.1 which refers to "0". That means at this time step we don't get our predicted "e" which refers to wrong prediction and high loss as well.

At test time, we will feed one character(prefix) to the RNN and get a sample for that time step. We will put this sample to the next time step. We use sample for the next time step is actually a benifit becuase by doing so the the model can diversify.

**Q** Why do we use one hot as input instead of softmax vector at test time?

**Ans**: Because the model is trained with one hot vector. Therefore if it sees other than this then it may fail to give correct output. Moreover, in practical the input may be too large. For example, if we generater word instead of character then the vocabulary contains any word in english dictionary. This is computationally really bad.

![alt text](images/7.PNG)

### Backpropagation and Gradient calculation 
In this process of RNN forward propagation is kind of forward through time and backward propagation is backward through time. Therefore if the example is too large such as one whole wikipedia article then for all the words in forward pass and then calculating gradients for all the words and updating the gradient is not practical. Because it will take long in fact model will not converge. Therfore we will calculate loss only after a certain number of steps say 100 and so on. By doing so the forward pass and all the hidden time step remains the same but the backward pass and gradient calculations become easier because we only calculate gradients over the stepped loss. we call this method **Truncated Backpropagation** through time.

![alt text](images/9.PNG) ![alt text](images/8.PNG)

### Gradient flow for vanilla RNN

We already know, RNN takes previous hidden state $h_{t-1}$ and input $x_t$ at any time step t, and pass the calculated result through tanh, $h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t)$.
For backpropagation we need to find out how the last time step affects the weights of the first time step $W_{hh}$

**calculating gradients**:
The partial derivative of $h_t$ w.r.t. $h_{t-1}$ is, $\frac{\partial h_t}{\partial h_{t-1}} =  tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)W_{hh}$

We update $W_{hh}$ by getting derivative of loss at last time step w.r.t. $W_{hh}$.

$\begin{aligned}
\frac{\partial L_{t}}{\partial W_{hh}} = \frac{\partial L_{t}}{\partial h_{t}} \frac{\partial h_{t}}{\partial h_{t-1} } \dots \frac{\partial h_{1}}{\partial W_{hh}} \\
= \frac{\partial L_{t}}{\partial h_{t}}(\prod_{t=2}^{T} \frac{\partial h_{t}}{\partial  h_{t-1}})\frac{\partial h_{1}}{\partial W_{hh}} \\
= \frac{\partial L_{t}}{\partial h_{t}}(\prod_{t=2}^{T}  tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)W_{hh}^{T-1})\frac{\partial h_{1}}{\partial W_{hh}} \\
\end{aligned}$

Now it's time to apply all these theory into code. Here I will use the code which was written by **Karpathy** and will analyse the code. 

Code link: https://gist.github.com/karpathy/d4dee566867f8291f086

In [16]:
"""
Minimal character-level Vanilla RNN model. Written by Andrej Karpathy (@karpathy)
BSD License
"""

'\nMinimal character-level Vanilla RNN model. Written by Andrej Karpathy (@karpathy)\nBSD License\n'

In [17]:
import numpy as np
# data I/O
data = open('data/Venus_and_Adonis.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print ('data has %d characters, %d unique.' % (data_size, vocab_size))

data has 1136 characters, 43 unique.


In [None]:
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

These dictionaries are created to map characters to their corresponding integer indices and vice versa. These dictionaries are useful when particulary working with text datas.

In [18]:
# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1

In [19]:
# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

In [20]:
def lossFun(inputs, targets, hprev):
  """
  inputs,targets are both list of integers.
  hprev is Hx1 array of initial hidden state
  returns the loss, gradients on model parameters, and last hidden state
  """
  xs, hs, ys, ps = {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0
  # forward pass
  for t in range(len(inputs)):
    xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
    xs[t][inputs[t]] = 1
    hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
    ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
    loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
  # backward pass: compute gradients going backwards
  dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
  dbh, dby = np.zeros_like(bh), np.zeros_like(by)
  dhnext = np.zeros_like(hs[0])
  for t in reversed(range(len(inputs))):
    dy = np.copy(ps[t])
    dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
    dWhy += np.dot(dy, hs[t].T)
    dby += dy
    dh = np.dot(Why.T, dy) + dhnext # backprop into h
    dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
    dbh += dhraw
    dWxh += np.dot(dhraw, xs[t].T)
    dWhh += np.dot(dhraw, hs[t-1].T)
    dhnext = np.dot(Whh.T, dhraw)
  for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
  return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]

In [21]:

def sample(h, seed_ix, n):
  """ 
  sample a sequence of integers from the model 
  h is memory state, seed_ix is seed letter for first time step
  """
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1
  ixes = []
  for t in range(n):
    h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
    y = np.dot(Why, h) + by
    p = np.exp(y) / np.sum(np.exp(y))
    ix = np.random.choice(range(vocab_size), p=p.ravel())
    x = np.zeros((vocab_size, 1))
    x[ix] = 1
    ixes.append(ix)
  return ixes

- this function generates a sequence of integers from a trained RNN language model. It takes an initial memory state "h", a seed character index "seed_ix" to start generation, and the desired sequence length "n".
- x is used to represent the one hot encodeing of the current character
- ixes: to store generated character indices.
- ix = np.random.choice(range(vocab_size), p=p.ravel()): Sample the next character index (ix) from the probability distribution p. This is done using np.random.choice, which selects an index from range(vocab_size) with the probabilities defined by p.
- x[ix] = 1: Set the element in x at index ix to 1, encoding the character that was selected as the next character in the sequence.


In [22]:
n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0

In [24]:
while True:
  # prepare inputs (we're sweeping from left to right in steps seq_length long)
  if p+seq_length+1 >= len(data) or n == 0: 
    hprev = np.zeros((hidden_size,1)) # reset RNN memory
    p = 0 # go from start of data
  inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
  targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

  # sample from the model now and then
  if n % 100 == 0:
    sample_ix = sample(hprev, inputs[0], 200)
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    print('----\n %s \n----' % (txt, ))

  # forward seq_length characters through the net and fetch gradient
  loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
  smooth_loss = smooth_loss * 0.999 + loss * 0.001
  if n % 100 == 0: print('iter %d, loss: %f' % (n, smooth_loss)) # print progress
  
  # perform parameter update with Adagrad
  for param, dparam, mem in zip([Wxh, Whh, Why, bh, by], 
                                [dWxh, dWhh, dWhy, dbh, dby], 
                                [mWxh, mWhh, mWhy, mbh, mby]):
    mem += dparam * dparam
    param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

  p += seq_length # move data pointer
  n += 1 # iteration counter 
  if n==500000:
    break

----
 ad toâ€kre withen,

 And yreind nasith pand thye hove toy lold-fiugwith sealive thys t steith that hee a
  
Heir aind sekdisunty:
 Move thaf and hourâ€™ss;

â€™d see c.ost seingyn  hithek,

HHe-chat r 
----
iter 5400, loss: 34.792482
----
  sich pr
 And pan     

 

â€™urtyâfe, wided the wisn then and don shid  d

 


Anndece then too hindd belnand se tyy  

Mome pang nis thouv

 Veusufeedyme, ton he thot lim,d wot himis tho wemedâ€™d d 
----
iter 5500, loss: 34.356470
----
 e;â€˜€˜€˜bucses sute wim hus  haseesw;

â€˜Thed dthen wit shte sham refâ€™d boufawe the baber, florhâ€™d on shen,  on Sfpir thith chet: Ao kink-bog;

 Mons thufâ€™d dlin 

Siprir.

Bure a loir cosun,
 
----
iter 5600, loss: 34.033631
----
  d

S-€™dde-Anonn logh,
â€™dlyie, o bogi ,
 bipsher;
 
â€˜wished longhet roth scorer parst stelfre fiela
 bathâ€™d;e womre love le;

 bier thistpy thasithy mafâ€™ldy wisued woo thos Vaothas dacan sake 
----
iter 5700, loss: 33.622358
----
  a
 bucisto ao sVelyy,

 €˜Andlp-n