  ## Recurrent Neural Networks 
  
  [Wonderful Paper on the status of RNNs in 2017](https://arxiv.org/abs/1801.01078?utm_campaign=Revue%20newsletter&utm_medium=Newsletter&utm_source=The%20Wild%20Week%20in%20AI)
  
  The recurrent neural network is used for time series data and text. Its purpose was derrived from the need to model memory in networks, since a feed-forward net treats each sample individually. 
  
A RNN refers toa network of artificial neurons with recurrent connections among them. The recurrent connections learn the dependencies among input sequential or time-series data. The ability to learn sequential dependencies has allowed RNNs to gain popularity in applications such as speech recognition, speech synthesis, machine vision, and video description generation.  
  
A Simple RNN just consists of updating a state vector iteratively white training, as so: 
  
  ```python 
state = 0
for input_ in input_sequence: 
    state = activation(dot(W, input_) + dot(U, state) + b)
```

But this naive model quickly loses memory via the vanishing gradient problem. This problem refers to the exponential shrinking of gradient magnitudes as they are propagated back through time. This phenomena causes memory of the network to ignore long term dependencies and hardly learn the correlation between temporally distant events. There are two reasons for that:

1. Most nonlinear functions (sigmoid)have a gradient which is  close to zero almost everywhere

2. The magnitude of gradient is multiplied over and over by the recurrent matrix as it is back-propagated

Likewise,  gradients in training RNNs on long sequences may explode as the weights become larger and the norm of the gradient during training largely increases. One of the main challenges is training RNNs is learning long-term dependencies in data. It occurs generally due to
the large number of parameters that need to be optimized during training in RNN over long periods of time.

--------------------------------------------------------

### LSTM 

One solution is to extend the memory manually. Imagine a conveyor belt running parallel to the sequences you are processing. Information can job onto the conveyor belt at anytime, and jump back on the network at a later timestep, intact. This is the core idea of an LSTM. 

Under the hood, this approach changes the structure of hidden units from “sigmoid” or “tanh” to memory cells, in which their inputs and outputs are controlled by gates. These gates control flow of information to hidden neurons and preserve extracted features from previous timesteps

Below is a Keras LSTM. For multi-layer LSTM's you need to set the `return_sequences` parameter to True for all intemediate layers:

```python 
model = Sequential()
model.add(layers.LSTM(32, 
                    return_sequences=True, #first or middle layers 
                    input_shape = input_shape))
model.add(layers.LSTM(32, 
                    return_sequences=False, #last layer  
                    ))
model.add(Dense(1))
```

The LSTM has a higher computational cost the a simple RNN. It suffers from high complexity in the hidden layer. Foridentical size of hidden layers, a typical LSTM has about four times more  parameters than a simple RNN.



#### GRU - Gated Recurrent Unit
A computationally less expensive version of the LSTM is the GRU, or gated recurrent unit. These work in the same principle as an LSTM but are more streamlined (but have less representational power). They are more robust to vanishing gradients compared to RNN and have lower memory requirements. 

A basic  LSTM implementation can be found below. 

In [4]:
from keras.datasets import imdb
from keras.preprocessing import sequence

max_features = 10000
maxlen = 500
batch_size = 32

(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words = max_features)
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [23]:
from keras.layers import Dense, Embedding, SimpleRNN, LSTM
from keras.models import Sequential 

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics =['acc'])
history = model.fit(input_train, y_train, 
                   epochs=10, 
                   batch_size =128, 
                   validation_split = 0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


------------------------------------------------------------

### LSTM Regularization 

#### Dropout 

Dropout is a common method to prevent overfitting, but applying into the RNNs is not trivial. Yarin Gal discoverred in 2015 that the optimial way is to apply the same dropout mask to each timestep, rather than individually. You can implement this automatially in Keras. There are two options: `dropout`, which applies the dropout rate for the input units, and `recurrent_dropout`, which specifies the dropout rate for the recurrent units. 

```python 
model.add(layers.GRU(32, 
                     dropout = 0.2, 
                     recurrent_dropout = 0.2)
```

### Bidirectional LSTM 

A bidirectional LSTM is the swiss army knife of NLP. It takes advantage of the fact that you can process temporal data both forwards and backwards to extra more meaning that one direction alone. Sometimes, this techique can catch patters not seen by unidirectional RNNs alone. 

In implementation, the input vector is copied, reversed, then both are fed into separate LSTM layers. These are then merged (add, concatenate) and spit out. In keras, you can make one as so: 

```python 
model = Sequential()
model.add(layers.Embedding(max_features, 32))
model.add(layers.Bidirectional(layers.LSTM(32))
model.add(layers.Dense(1))

```

------------------------------------------------------------------------

## Convolutions in Sequence Processing 

Like in convnets, convolutions are good for processing sequences. This is because of their ability to represent location data from small patches. A 1D convnet for sequences is less computationally expensive than an RNN, and is good alternative for simple tasks. You can use Convultions when **global ordering is not meaninful.** Otherwise RNNs win out. 

You can use larger convoltion window sizes - up to 7 or 9. The architecture is similar to image nets - Conv1d --> MaxPooling --> Flatten. Here is an example: 

```python 
model = Sequential()
model.add(layers.Embeddig(max_feautres, 128, input_lenth=max_len))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D(5))
model.add(layers.Dense(1))
```