# 1. Recurrent layers
[Recurrent Neural Network] (RNN) is a class of neural network architectures where nodes in a layers have internal connections, allowing to express temporal behaviour. There are many types of RNN layers, but they all share the same architecture. The image below shows the information flow for an observation, or for a document in the context of NLP.

<img src='image/rnn_general.png' style='height:175px; margin:20px auto;'>

Each green cell $\mathbf{x}_t\in\mathbb{R}^{V\times1}$ represents the embedding vector of a token, and each blue cell $\mathbf{h}_t\in\mathbb{R}^{D\times1}$ represents an output vector. With the input sequence size is fixed at $T$, RNN adjusts itself to match the input length. The most important part of a RNN layer is the grey cell $A$ that repeats multiple times, being account for information processing. We can see that at a time step, the output value $\mathbf{h}_t$ is influenced by all previous steps $\mathbf{h}_{t-1},\mathbf{h}_{t-2},\dots$, besides the input $\mathbf{x}_t$. This design resembles *memory* and enables RNN to capture sequential relationship.

There are many architectures for a recurrent layers, the only difference between them is how the cell $A$ being desgined. In this article, we are going to learn the cell architectures of Simple RNN, LSTM and GRU.

[Recurrent Neural Network]: https://en.wikipedia.org/wiki/Recurrent_neural_network

## 1.1. Common blocks
This section introduces common blocks in recurrent architectures. Knowing each of them separately helps us understanding compicated designs better.

### Concatenation

<img src='image/rnn_concatenation.png' style='height:100px; margin:0px auto;'>

Let's say we want to transform two input vectors $\mathbf{u}\in\mathbb{R}^{U\times1}$ and $\mathbf{v}\in\mathbb{R}^{V\times1}$ into $\mathbf{y}\in\mathbb{R}^{D\times1}$. Note that $U$ and $V$ are fixed dimensionalities of input, while $D$ is the desired output size. With weight matrices
$\mathbf{W}_{yu}\in\mathbb{R}^{D\times U},\mathbf{W}_{yv}\in\mathbb{R}^{D\times V}$
and bias vector $\mathbf{b}_y\in\mathbb{R}^{D\times1}$,
the actual formula behind the above image is:

$$\mathbf{y}=\mathbf{W}_{yu}\mathbf{u}+\mathbf{W}_{yv}\mathbf{v}+\mathbf{b}_y$$

Here, all three terms have size $(D\times1)$, same as $\mathbf{y}$. We can also view the above formula as concatenating $\mathbf{u}$ and $\mathbf{v}$ into a single input vector $\mathbf{x}\in\mathbb{R}^{(U+V)\times1}$, then scale it using a bigger weight matrix $\mathbf{W}_{yx}\in\mathbb{R}^{D\times(U+V)}$. This explains why the formula is visualized as a concatenation.

### Gate

<img src='image/rnn_gate.png' style='height:80px; margin:0px auto;'>

A gate consists of two calculation steps, (1) passing a vector into sigmoid function and (2) using it as a percentage multiplier. The sigmoid function (denoted $\sigma$) is account for producing numbers in range $(0,1)$. We can see the purpose of gates very clearly here: they control how much information should be let through.

## 1.1. Simple RNN

### Architecture
We call the vanilla architecture [Simple RNN] (1980s) to distinguish from the family name. Its cells is very simple, with only a concatenated value pass through an activation function. The activation function is usually $\tanh$ which produces values within the range $(-1,1)$, so that the network will be able to express *sentiment*. The cell architecture is described in the image and formula as follows:

<img src='image/rnn_cell.png' style='height:160px; margin:0px auto;'>

$$\mathbf{h}_t=\phi(\mathbf{W}_{hx}\mathbf{x}_t+\mathbf{W}_{hh}\mathbf{h}_{t-1}+\mathbf{b}_h)$$

$\mathbf{W}_{hx},\mathbf{W}_{hh}$ and $\mathbf{b}_h$, as explained earlier, are the weight matrices and bias vector. Their corresponding sizes are $(D\times V)$, $(D\times D)$ and $(D\times 1)$. Note that these parameters are used across cells, hence taking sum of their sizes gets us the total number of parameters need to be trained. For example, we use a BERT pretrained model to encode a corpus containing $N=10\,000$ documents. The embedding dimension is $V=512$ and documents are truncated to have $T=128$ tokens. If we set the number of units of the RNN layer to $D=50$, then the number of parameters our network has is $D\times(D+V+1)=8950$.

A well-known issue with Simple RNN is that it only has *short-term memory*. This property is very easy to understand if you are familiar with the gradient vanishing problem of S-shaped activation functions. During [backpropagation through time] for a pair of words with large $\Delta t$, the product of partial derivatives may trigger saturation zones of $\tanh$, making the derivative of a word with respect to the other almost zero. As a result, Simple RNN fails to capture long-term memory.

[Simple RNN]: https://en.wikipedia.org/wiki/Recurrent_neural_network#Fully_recurrent
[backpropagation through time]: https://en.wikipedia.org/wiki/Backpropagation_through_time

### Implementation
RNN is implemented in TensorFlow in both layer-level and cell-level, via the classes
<code style='font-size:13px'><a href=https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN>SimpleRNN</a></code>
and
<code style='font-size:13px'><a href=https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNNCell>SimpleRNNCell</a></code>.
They have the following hyperparameters:
- <code style='font-size:13px; color:firebrick'>units</code>: the dimensionality of output space ($D$).
- <code style='font-size:13px; color:firebrick'>activation</code>: the activation function used in cells, defaults to *tanh*.

<b style='color:navy'><i class="fa fa-info-circle"></i>&nbsp; Note</b><br>
When using RNN with other layers, there are two cases:
- The next layer being Fully Connected, then we only use the last hidden state, $\mathbf{h}_T$. The output shape in this case is  $(N\times D)$.
- The next layer being another RNN layer (including LSTM and GRU), then we need to return the full sequence $\mathbf{h}_1,\mathbf{h}_2,\dots,\mathbf{h}_T$. This is done by specifying <code style='font-size:13px'>return_sequences=True</code>. The output shape this time is $(N\times T\times D)$.

In [1]:
from sspipe import p, px
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers as layers
import tensorflow_hub as hub
import tensorflow_text as text

In [2]:
x = np.random.random((32,10,8))
y = np.random.random((32,))

In [3]:
model = keras.Sequential([
    layers.SimpleRNN(5, input_shape=(10,8), return_sequences=True),
    layers.SimpleRNN(3),
    layers.Dense(10)
])
model.compile(loss='mse', optimizer='adam')
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 10, 5)             70        
                                                                 
 simple_rnn_1 (SimpleRNN)    (None, 3)                 27        
                                                                 
 dense (Dense)               (None, 10)                40        
                                                                 
Total params: 137
Trainable params: 137
Non-trainable params: 0
_________________________________________________________________


In [35]:
model.weights

[<tf.Variable 'simple_rnn_15/simple_rnn_cell_15/kernel:0' shape=(8, 4) dtype=float32, numpy=
 array([[-0.12565374, -0.17764747,  0.19202441,  0.00790167],
        [ 0.42867404,  0.17833441, -0.6105001 ,  0.49296635],
        [-0.40300983,  0.3589633 , -0.26932156, -0.41699216],
        [-0.28938767,  0.16760606, -0.34825772,  0.5725295 ],
        [ 0.3798756 , -0.45982867,  0.30792862, -0.05597204],
        [ 0.39521652,  0.21937352,  0.3404147 ,  0.5420316 ],
        [ 0.37472934,  0.37950474, -0.04657042,  0.18962324],
        [ 0.37935108,  0.24689758, -0.4418146 , -0.25006822]],
       dtype=float32)>,
 <tf.Variable 'simple_rnn_15/simple_rnn_cell_15/recurrent_kernel:0' shape=(4, 4) dtype=float32, numpy=
 array([[-0.80420995, -0.04441664,  0.47590402,  0.35325453],
        [ 0.07364351, -0.4747402 ,  0.57299924, -0.6639806 ],
        [-0.29404065,  0.7626039 ,  0.0019677 , -0.57616967],
        [-0.51123667, -0.4371317 , -0.66722065, -0.31995237]],
       dtype=float32)>,
 <tf.Varia

In [4]:
_ = [print(weight.shape) for weight in model.weights]

(8, 5)
(5, 5)
(5,)
(5, 3)
(3, 3)
(3,)
(3, 10)
(10,)


In [22]:
model.fit(x, y)



<keras.callbacks.History at 0x26f277c7ee0>

In [25]:
dfSpam = pd.read_csv('data/spam_message.csv')
dfSpam.sample(frac=1, random_state=5)
dfTrain = 

In [34]:
dfSpam.sample(frac=1, random_state=5)

Unnamed: 0,spam,content
2095,0,"Probably, want to pick up more?"
5343,0,No go. No openings for that room 'til after th...
564,0,"Fuck babe ... I miss you already, you know ? C..."
3849,0,I to am looking forward to all the sex cuddlin...
3317,0,I'm freezing and craving ice. Fml
...,...,...
3046,0,"Ok. Not much to do here though. H&M Friday, ca..."
1725,0,You know there is. I shall speak to you in &l...
4079,0,"Sir, good morning. Hope you had a good weekend..."
2254,0,Ok. Me watching tv too.


In [26]:
dfSpam

Unnamed: 0,spam,content
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


In [24]:
bertProcessor = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')
bertEncoder = hub.KerasLayer('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/2')

In [20]:
doc = [
    'this is such an amazing movie',
    'this movie is terrible',
]

doc = bertProcessor(doc)
embed = bertEncoder(doc)

ValueError: Exception encountered when calling layer "packer" (type KerasLayer).

Could not find matching concrete function to call loaded from the SavedModel. Got:
  Positional arguments (3 total):
    * ['this is such an amazing movie', 'this movie is terrible']
    * False
    * None
  Keyword arguments: {'seq_length': 16}

 Expected these arguments to match one of the following 4 option(s):

Option 1:
  Positional arguments (3 total):
    * TensorSpec(shape=(None,), dtype=tf.string, name='sentences')
    * False
    * None
  Keyword arguments: {}

Option 2:
  Positional arguments (3 total):
    * TensorSpec(shape=(None,), dtype=tf.string, name='sentences')
    * True
    * None
  Keyword arguments: {}

Option 3:
  Positional arguments (3 total):
    * TensorSpec(shape=(None,), dtype=tf.string, name='inputs')
    * False
    * None
  Keyword arguments: {}

Option 4:
  Positional arguments (3 total):
    * TensorSpec(shape=(None,), dtype=tf.string, name='inputs')
    * True
    * None
  Keyword arguments: {}

Call arguments received:
  • inputs=["'this is such an amazing movie'", "'this movie is terrible'"]
  • training=None

In [7]:
embed['pooled_output'].shape

TensorShape([2, 128])

In [8]:
embed['sequence_output'].shape

TensorShape([2, 128, 128])

In [47]:
embed['pooled_output'].shape

TensorShape([2, 512])

In [48]:
embed['sequence_output'].shape

TensorShape([2, 128, 512])

## 1.2. LSTM
[LSTM] (Long Short-Term Memory, 1997)

<img src='image/lstm_cell.png' style='height:320px; margin:20px auto;'>

<code style='font-size:13px'><a href=https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM>LSTM</a></code>

[LSTM]: https://en.wikipedia.org/wiki/Long_short-term_memory

<img src='image/lstm_steps.png' style='height:520px; margin:20px auto;'>

## 1.3. GRU
[Gated Recurrent Units] (GRU)

<img src='image/gru_cell.png' style='height:320px; margin:20px auto;'>

<code style='font-size:13px'><a href=https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU>GRU</a></code>

[Gated Recurrent Units]: https://en.wikipedia.org/wiki/Gated_recurrent_unit

## 1.4. Bi-directional

<img src='image/rnn_bidirectional.png' style='height:245px; margin:20px auto;'>

# 2. Recurrent architectures

## 2.1. Seq2seq
[Seq2seq]

[Seq2seq]: https://en.wikipedia.org/wiki/Seq2seq

## 2.2. Attention
[Attention] implement
<code style='font-size:13px'><a href=https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention>Attention</a></code>

[Attention]: https://en.wikipedia.org/wiki/Attention_(machine_learning)

## 2.2. Transformer
[Transformer]

[Transformer]: https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)

# References
- *amitness.com - [Recurrent Keras layer](https://amitness.com/2020/04/recurrent-layers-keras/)*
- *colah.github.io - [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)*
- *d2l.ai - [Recurrent Neural Networks](https://d2l.ai/chapter_recurrent-neural-networks/rnn.html)*
- *d2l.ai - [Long Short-Term Memory](https://d2l.ai/chapter_recurrent-modern/lstm.html)*
- *d2l.ai - [Gated Recurrent Units](https://d2l.ai/chapter_recurrent-modern/gru.html)*
- *distill.pub - [Memorization in RNNs](https://distill.pub/2019/memorization-in-rnns/)*
- *distill.pub - [Augumented RNNs](https://distill.pub/2016/augmented-rnns/)*
---
- https://www.kaggle.com/code/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert
- https://www.kaggle.com/code/kredy10/simple-lstm-for-text-classification