# Recurrent Neural Networks

So far we have been looking at convolutional neural networks and models that allows us to analyze the spatial information in a given input image. CNN's excel in taks that rely on finding spatial and visible patterns in training data. 

RNNs on the other hand give us a way to incorporate **memory** into our neural networks, and will be critical when analyzing sequential data. RNN's are most associated with text processing and text generation because of the way sentences are structured as a sequence of words. 

So far the neural network architectures we've seen have been trained using the current inputs only. We did not consider previous inputs when generating the current output. In other words, our systems did not have any memory elements. RNNs address this very basic and important issue by using **memory** (past inputs to the network) when producing the current output. 

RNNs are artificial neural networks, that can capture temporal dependencies, which are dependencies over time. They are called Recurrent neural networks, because recurent means occuring often or repeatedly, and we perform the same task for each element in the input sequence. 

**Feedforward neural networks** Nonlinear function approximation, training using backpropagation or stochastic gradient decent, and evaluation. 

Simple RNN aka the Elman Network, and learn how to train the network, also understanding the limitations of simple RNNs and how they can be overcome by using LSTMs. 

### RNNs History
**Elman and Jordan Networks**

It was recognized in the early 90s that all of these networks suffer from what we call, the **vanishing gradient problem**, in which contributions of information decayed geometrically over time. So capturing relationships that spanned more than 8 or 10 steps back was practically impossible. 

What does this mean? When training our network we use **backpropagation**, which is the process we adjust our weight matricies with the use of a **gradient**. In the process, gradients are calculated by continuous multiplications of derivatives. The value of these derivatives may be so small, that these continuous multiplications may cause the gradient to practically "vanish". **LSTM** is one option to overcome this issue. [More Resources on Vanishing gradients](https://en.wikipedia.org/wiki/Vanishing_gradient_problem)

In the mid 90s, Long Short-Term Memory cells (LSTM), were invented to address this very problem. The key novelty in LSTMs was the idea that some signals, what we call state variables, can be kept fixed by using gates, and reintroduced or not at an appropriate time in the future. In this way, arbitrary time intervals can be represented, and temporal dependencies can be captured. Variations on LSTMs such as Gated Recurrent Neworks, or GRUs for short - Gated Recurrent Units, further refined this theme and nowadays, represent another mainstream approach to realizing RNNs, This give a solution to the vanishing gradient, by helping us apply networks that have temporal dependencies. In this lesson we will focus on RNNs and continue with LSTMs. 

### Applications

Predicting stock price movements based on historical patterns of stock movements and potentially other market conditions, that change over time. 

Other NLPs for chatbots etc.
Gesture Recognition

### Feedforward Neural Network Reminder

We can have many hiddlen layers between the inputs and the outputs, for simplicity lets use a single hidden layer in a FFNN.

With a good understanding of the FFNN, the step towards RNNs becomes simple. 

When implementing your neural net, you will find that we can combine these techniques, Conv Net with a Recurrent Network. 

![cnns_with_rnns](cnns_with_rnns.png)

One can use CNNs in the first few layers for the purpose of feature extraction, and then use RNNs in the final layer where memory needs to be considered. A popular application for this is in gesture recognition. 

When working on a feedforward neural network, we actually simulate an artificial neural network by using a nonlinear function approximation. 

$$ \hat{y} = F(\bar{x}, W)$$

That function will act as a system that has $n$ number of inputs, weights, and $k$ number of outputs. 
Using $\bar{x}$ as the input vector and $\bar{y}$ as the output. Inputs and outputs can also be many-to-many, many-to-one, and one-to-many. 

So, when neural network essentially works as a nonlinear function approximator, what we do is we try to fit a smooth function between given points like x1, y1, x2 y2, and so on, in such a way that when have a new input ${x}^\prime$ we can find the new ouput $y^\prime$.

There are two main types of applications. One is classification, where we identify which of a set of categories a new input belongs to. For example, an image classification where the neural network receives as an input an image and can know if it's a cat. The other application is regression, where we approximate a function, so the network produces continuous values following a supervised training process. A simple example can be time series forecasting, where we predict the price of a stock tomorrow based on the price of the stock over the past five days. The input to the network would be five values representing the price of the stock for each of the past five days, and the output we want is tomorrow's price. 

Our task in neural networks is to find the best set of weights that yield a good output where x represents the inputs, and W represents the weights. We start with random weights. In feedforward neural networks, we have static mapping from the inputs to the outputs. We use the word static as we have no memory and the output depends only on the inputs and the weights. In other words, for the same input and the same weights, we always receive the same output. 

![neural_task](neural_network_task.png)

There are two primary phases
- Training
- Evaluation

In the training phase, we take the dataset called the training set which includes many pairs of inputs and their corresponding targets or outputs. And the goal is to find a set of weights that would best map the inputs to the desired outputs. 

In other words, the goal of the training phase is to yield a network that generalizes beyond the train set. 

In the evaluation phase we use the network that was created in the training phase, apply our new inputs, and expect to obtain the desired outputs. 

Let's now look at a basic model of an artificial neural network, where we have only a single, hidden layer. The inputs are each connected to the neurons in the hidden layer and the neurons in the hidden layer are each connected to the neurons in the output layer where each neuron there represents a single output. We can look at it as a collection of mathematical functions. Each input is connected mathematically to a hidden layer, of neurons through a set of weights we need to modify, and each hidden layer neuron is connected to the output layer in a similar way. There is no limit to the number of inputs, number of hidden neurons in a layer, and number of ouputs, nor are there any correlations between those numbers, so we can have $n$ inputs, $m$ hidden neurons, and $k$ outputs. 

$$h_i = F(x_i, W_{ij}^1)$$

$$y_i = F(h_i, W_{ij}^2)$$

![network_output_map](network_function_map.png)

We can see that each input is multiplied by its corresponding weight and added at the next layer's neuron with a bias as well. The bias is an external parameter of the neuron and can be modeled by adding an external fixed value input. This entire summation will usually go through an activation function to the next layer or to the output. 

**Whats our goal?**
We can look at our system as a black box that has $n$ inputs and $k$ outputs. Our goal is to design the system in such a way that it will give us the correct output $y$ for a specific input $x$. Our job is to decide what's inside this black box. We knw that we will use artificial neural networks and need to train it to eventually have a system that will yield the correct output to a specific input. Essentially, what we really want is to find the optimal set of weights connecting the input to the hidden layer, and the optimal set of weights connecting the hidden layer to the output. We will never have a perfect estimation, but we can try to be as close to it as we can. To do that, we need to start a process you're already familiar with, and that is the training phase. So let's look at the training phase where we will find the best set of weights for our system.

This phase will include two steps, feedforward and backpropagation, which we will repeat as many times as we need until we decide that our system is as best as it can be. In the feedforward part, we will calculate the output of the system. 

The output will be compared to the correct output giving us an indication of the error. There are a few ways to determine the error. 

In the backpropagation part, we will change the weights as we try to minimize the error, and start the feedforward process all over again. 

### Calculating the value of the hidden states

We will look at the feedforward network, to make our computations easier, let's decide to have $n$ inputs, three neurons in a single hidden layer, and two outputs. In practice, we can have thousands of neurons in a single hidden layer. We will use $W_1$ as the set of weights from $x$ to $h$, and $W_2$ as the set of weights from $h$ to $y$.

Since we have only one hidden layer, we will habe only two steps in each feedforward cycle. 

![feedforward_cycle](feedforward_cycle.png)

- Step 1: We'll be finding $h$ from a given input and a set of weights $W_1$
- Step 2: we'll be finding the output $y$ from the calculated $h$ and the set of weights $W_2$.

You will find that other than the use of non-linear activation functions, all of the calculations involve linear combinations of inputs and weights. Or in other words, we will use matrix multiplications. 

Starting with **Step 1**. Finding $h$, 

Notice that if we have more than one neuron in the hidden layer, which is usually the case, h is actually a vector. We will have out initial inputs x, x is also a vector, and we want to find the values of the hidden neurons, h. Each input is connected to each neuron in the hidden layer. For simplicity, we will use the following indices: $W_{11}$ connects $x_1$ to $h_1$

![w1](w1_math.png)

The vector on the inputs $x_1, \cdots , x_n$ is multiplied by the weight matrix $W_1$ to give us the hidden neurons $h$. So each vector $h$ equals vector $x$ multiplied by the weight matrix $W_1$. 

$$\bar{h}^\prime = \bar{x} \cdot W_1$$

In this case we have a weight matrix with $n$ rows, as we have $n$ inputs, and three columns, as we have three neurons in the hidden layer. If you multiply the input vector by the weight matrix, you will have a simple linear combination for each neuron in the hidden layer giving us vector, 