# Recurrent Neural Networks Implementation
Why is this important?
- Attention mechanism in transformer has encoder and decoder 

Resources:
- https://towardsdatascience.com/recurrent-neural-networks-rnns-3f06d7653a85

## 1/ RNN About:
-  A neural network that is specialized for processing sequence of data
- For NLP, you want to predict the next word in a sentence, it is important to know the words before it.
- Recurrent = they perform the same task for every element of a sequence, with the output being depended on the previous computation. We can think that the RNN has memory which captures the info about what has been calculated so far
- The gradient at each out put depends not only on the calculations of the current time step but also in the previous time steps

## 2/ Implementation:
- Build a text generation model with RNN. 
- Train model to predict the probability of character given the preceding charaters
- Steps:
    - 1. Initialize weight matrices U,V,W from random distribution and bias b,c with zeros
    - 2. Forward propagation to compute prediction
    - 3. Compute the loss
    - 4. Back-propagation to compute gradients
    - 5. Update weights based on gradients
    - 6. Repeat 2-5

### Step 1: Initialization

### Step 2: Forward pass
We have a set of equation
- a(t) = b + W*h(t-1) + U*x(t)
- h(t) = tanh(a(t))
- o(t) = c+ V*h(t)
- y(t) = softmax(o(t))

### Step 3: Compute Softmax and Numerical Stability
- Softmax function take N-dim vector of real number and transfer it into a vector or real numberin range [0,1] and add up to 1

### Step 4: Compute Loss

### Step 5: Backward Pass

### Step 6: Update Weights

In [15]:
import numpy as np

# Vocab size can be the number of unique chars from a char based model or number of unique words from a word based model 
class RNN:
    def __init__(self, hidden_size, vocab_size, seq_length, lr):
        # hyper parameters:
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        self.seq_length = seq_length
        self.lr = lr
        
        # model parameter - random initialization:
        # it is recommend that the weight initialization is randomly from [-1/sqrt(n), 1/sqrt(n)] with n = the number of incoming connection from the previous layer
        # function (low, high, size = the output shape)
        self.U = np.random.uniform(-np.sqrt(1./vocab_size), np.sqrt(1./vocab_size), (hidden_size, vocab_size))
        self.V = np.random.uniform(-np.sqrt(1./hidden_size), np.sqrt(1./hidden_size), (vocab_size, hidden_size))
        self.W = np.random.uniform(-np.sqrt(1./hidden_size), np.sqrt(1./hidden_size), (hidden_size, hidden_size))
        self.b = np.zeros((hidden_size, 1)) # bias for hidden layer
        self.c = np.zeros((vocab_size, 1)) # bias for output
    
    def forward(self, inputs, hprev):
        xs, hs, os, ycap = {}, {}, {}, {}
        hs[-1] = np.copy(hprev)
        for t in range(len(inputs)):
            xs[t] = zero_init(self.vocab_size, 1)
            xs[t][input[t]] = 1 # one hot encoding
            hs[t] = np.tanh(np.dot(self.U, xs[t]) + np.dot(self.W, hs[t-1]) + self.b)
            os[t] = np.dot(self.V, hs[t]) + self.c # unnormalized log probs for the next char
            ycap[t] = self.softax(os[t]) # probs for next char
            
        return xs, hs, ycap
    
    def softmax(self, x):
        p = np.exp(x - np.max(x))
        return p/np.sum(p)
    
    def loss(self, ps, targets):
        # Calculate cross-entropy loss
        return sum(-np.log(ps[t][targets[t], 0]) for t in range(self.seq_length))
    
    def backward(self, xs, hs, ycap, targets): # ycap = prediction, targets = groundtruth
        # compute the gradients going backwards
        dU, dW, dV = np.zeros_like(self.U), np.zeros_like(self.W), np.zeros_like(self.V)
        db, dc = np.zeros_like(self.b), np.zeros_like(self.c)
        dhnext = np.zeros_like(hs[0]) # the next stage of h
        
        for t in reversed(range(self.seq_length)):
            dy = np.copy(ycap[t])
            dy[targets[t]] -= 1
            
            dV += np.dot(dy, hs[t].T)
            dc += dc
            
            # dh has 2 compoentns, gradient flowing from output and from the next cell
            dh = np.dot(self.V.T, dy) + dhnext # backprop into h
            
            # dhrec is the recurring componenet seen in most of the calculation
            dhrec = (1-hs[t] * hs[t]) * dh # backprop thru tanh non-linearity
            db += dhrec
            
            dU += np.dot(dhrec, xs[t].T)
            dW += np.dot(shrec, hs[t-1].T)
            
            # pass gradient from next cell for the next iteration
            dhnext = np.dot(self.W.T, dhrec)
        
        # To mitigate gradient explosion, clip the gradients
        """
        RNN can have problem about vanishing gradient or exploding gradient
        meaning that the product of these gradients can goto 0 or increase exponentially
        This makes it impossible for the model to learn
        """
        for dparam in [dU,dW,dV,db,dc]:
            np.clip(dparam, -5,5,out=dparam)
        return dU,dW,dV,db,dc
    
    def update_model(self, dU,dW,dV,db,dc): # SGD
        for param, dparam in zip([self.U,self.W, self.V, self.b, self.c], [dU,dW,dV,db,dc]):
            # Change params according to gradients and learning rate
            param += -self.lr*dparam
    
    def predict(self, data_reader, start, n):
        # initialize input vector
        x = zero_init(self.vocab_size, 1)
        chars = [ch for ch in start]
        ixes = []
        
        for i in range(len(chars)):
            ix = data_reader.char_to_ix[chars[i]]
            x[ix] = 1
            ixes.append(ix)
        
        h = np.zeros((self.hidden_size, 1))
        # predict next n chars
        for t in range(n):
            h = np.tanh(np.dot(self.U, x) + np.dot(self.W, h) + self.b)
            y = np.dot(self.V, h) + self.c
            p = np.exp(y) / np.sum(np.exp(y))
            ix = np.random.choice(range(self.vocab_size), p=p.ravel())
            x = zero_init(self.vocab_size, 1)
            x[ix] = 1
            ixes.append(ix)
        txt = ''.join(data_reader.ix_to_char[i] for i in ixes)
        return txt

In [16]:
softmax([1,2,3])

array([0.09003057, 0.24472847, 0.66524096])

In [17]:
rnn.predict(data_reader, "year", 50)

NameError: name 'rnn' is not defined