# Recurrent neural networks

Importance of understanding neural networks:

Basics of machine learning, linear algebra, neural network architecture, cost functions, optimization methods, training/test sets, activation functions/what they do, softmax

What are recurrent neural networks:



### What can RNN do that ANN cannot?

Image captioning, language translation, sentiment classification, predictive typing, video classification, nlp, speech recognition, etc.

Feed forward NN are strong global function approximators. In other words, you can have a very difficult classification function and the FFNN can figure out the generaltivity(lol) of it. Recurrent neural networks take this to another level and instead they cam compute/describe an entire program. They can almost be considered turing complete (system in which a program can be used to solve any computation problem).

- ANN cannot deal with sequential or temporal data (because of weighted matrix and fixed input/output size)
    - For example if a neural network is to output a caption for a video, a list of words in a specific order would be required. This is a sequence, which it cannot output due to the fact nn cannot have variate the number of node output. However if one word or three words in a non sequential order were required to describe the video, NN would be fine.
    - Sequential is also not possible because when you are training a network, each feed forward iteration will have to depende
- ANN lack memory (Cannot store past results)
    - 
- ANN have a fixed architecture (Have to change the nn and re-train)
    - There is a fixed number of processing steps (bc number of hidden layers is a hyper parameter
    - Each neuron in is almost like an entire layer in an ANN.

### The flaws with RNN and why we need to use LSTM

In theroy, RNNs seem like an awesome solution. However, when your RNN starts to become very deep and issue called "vanishing gradient" arises.

Red = input, blue = hidden neuron, green = output

### One-to-one, one-to-many, many-to-one, many-to-many

#### Many to one
input sentence and return whether it was positive or not

#### Many to Many
The idea behind this is combining multiple anns together. You can combine a CNN with a RNN to create image captioning. Example. two people in a photo. The cnn will identify there are two. People will come from the rnn because it is functionally dependent on the second hidden state. In other words, given the word two, people should be next based on the RNN experience from training the initial image we inputted. EVery outputted word is dependent on the previous word LCRN. 

RNNs aren’t magic; they only work because trained networks identified and learned patterns in data during training time that they now look for during prediction.

What makes RNN so exciting is that they allow for operation over sequences of vectors.

In [1]:
# step function in Vanilla rnn

class RNN:
  def step(self, x):
    # update the hidden state
    self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
    # compute the output vector
    y = np.dot(self.W_hy, self.h)
    return y

rnn = RNN()
y = rnn.step(x) # x is an input vector, y is the RNN's output vector

NameError: name 'x' is not defined

### Step Equation
$$ h_t = \tanh ( W_{hh} h_{t-1} + W_{xh} x_t ) $$

RNN works awesome with stacking.
One RNN is recieving input vectors and the second RNN is receiving the output of the first RNN as its input.

- 512 RNN units = 1RNN neuron that outputs a 512 wide vector -> A vector with 512 values.
- One RNN unit -> an RNN with one hidden layer. Thus people say "Stacking RNNs on top of each other"

### Vanilla RNN Math

If an input or output neuron has a value at timestep t, we denote this vector as:
input -> $ x_t $ output -> $y_t $

Since we can have multiple hidden layers, we denote the hidden state vector at timestep t and hidden layer l as:
hidden -> $ H_t^l $

Example: Many-to-many RNN with sequential input, sequential output, multiple timesteps, and multiple hidden layers.
$$
h_t^l =
\begin{cases}
f_w(h_{t-1}^l, x_t)  & \text{for l = 1} \\
f_w(h_{t-1}^l, h_t^{l-1})  & \text{for l > 1}
\end{cases}
$$

First, let’s list out the possible functional dependencies for a given hidden state, based on the arrows and flow of information in the diagram:
- An input
- Hidden state at the previous timestep, same layer
- Hidden state at the current timestep, previous layer
A hidden state can have two functional dependencies at max. Just by looking at the diagram, the only impossible combination is to be dependent on both the input and a hidden state at the current timestep but previous layer. This is because the only hidden states that are dependent on input exist in the first hidden layer, where no such previous layer exists.

Because of the impossible combination, we have to define two separate equations. An equation for the hidden state at hidden layer **1** and for layers after 1.

The function **$ f_w $** computes the numeric hidden state vector for timestep **t** and layer **l**. This contains the activation function like in ANNs. **W** are the weights of the RNN and thus **f** is conditioned on **W**.


You might notice that we have a couple issues:
- When t = 1 — that is, when each neuron is at the initial timestep — then no previous timestep exists. However, we still attempt to pass h_0 as a parameter to ƒw.
- If no input exists at time t — thus, x_t does not exist — then we still attempt to pass x_t as a parameter.

Our respective solutions follow:
- Define h_0 for any layer as 0
- Consider x_t where no input exists at timestep t as 0


5 different types of weight matrices:
- input to hidden -> $W_{xh}$ < this maps an inpput vector **x** to hidden state vector h
- hidden to hidden in time -> $W_{hht}^l$ < maps a hidden state vector **h** to another hidden state vector h along with time axis EX. $h_{t-1}$ to $h_t$ 
- hidden to hidden in depth -> $W_{hhd}^l$ < maps hidden state vector **h** to another hidden state vector h along the depth axis. EX. $h^{l-1}_t$ to $h^l_t$ 
- hidden to output -> $W_{hy}$ < maps hidden state vector **h** to an output vector **y**
- biases -> $b_{h}^l, b_{y}^l$ < like ANN we add a constant bias vector that can vertically shift what we pass to the activation function.



### Defining the function **fw**

$$ h_t^l = f_w(h_t-1^l, x_t) \text{ for L = 1} $$

$$ = tanh($W_{hht}^l h_t-1^l + W_{xh}x_t + b_{h}^l) $$


$$ h_t^l = f_w(h_{t-1}^l, h_t^{l-1}) \text{ for L > 1} $$

$$ = tanh($W_{hht}^l h_t-1^l + W_{xh}x_t + b_{h}^l) $$

Does this look similar the ANN hidden function? It applies the weights to the corresponding parameters, adds the bias, and passes the weighted sum through an activation function to introduce non-linearities (aka raw probabilities). This contarsts from ANNs because RNNs operate over vectors versus scalars.

We tend to use tanh with RNNs mostly because of their role in LSTMs. (Product graidents with a greater range and that htier second derviative don't die off as quickly. Tanh has a greater range than the sigmoid. y = -1 instead of y = 0, intercept the y-axis at y = 0 instead of y = 0.5

### The final equation!
Mapping hidden state to an output
$$ y_t = W_{hy}h_l^t + b_y $$

Depending on the context, we might need to remove the bias vector and apply a non-linearity like sigmoid (if need output to be a probability distribution)

### Example
One to many single layer rnn needs to output "hello"

The NN has the vocabulary h,e,l,o. It only knows these four characters; exactly enough to produce the word "hello". We will input the first character "h" and from there expect the output at the following timesetps to be: "e", "l", "l", "o".

Lets represent the input and output via one hot encoding, where each char is a vector with a 1 at the corresponding character position. since our vocabulary is [h,e,l,o], we can represent characters using a vector with four values.

h=
$
\begin{bmatrix}
    1 \\
    0 \\
    0 \\
    0 \\
\end{bmatrix}
$
e = 
$
\begin{bmatrix}
    0 \\
    1 \\
    0 \\
    0 \\
\end{bmatrix}
$
l =
$
\begin{bmatrix}
    0 \\
    0 \\
    1 \\
    0 \\
\end{bmatrix}
$
o = 
$
\begin{bmatrix}
    0 \\
    0 \\
    0 \\
    1 \\
\end{bmatrix}
$


We input the first letter and the word is complete. OR we have 4 inputs and 4 outputs. We sample the output at each timestep and feed it into th next as input. RNNs need to have a start and end token. They signify when the input begins and the output ends.

### Backpropagation with RNNs
BPTT - back propagation through time.
Article on vanishing gradient problem: https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b

Because of this RNNs kind of suck. So onto LSTMs