## Clarification of A Recurrent Neuron and A Layer of Recurrent Neurons
Actually the figures `15-1` and `15-2` expresses quite accurately what each vector's dimension is, namely when it is **not boldface**,
it concerns a **scalar** (like the $y, y_{(t-3)}\,, y_{(t-2)}\,, y_{(t-1)}\,, y_{(t)}$ in figure `15-1`.) And when it is **boldface**,
it concerns a **vector** (like the $\mathbf{y}, \mathbf{y}_{(0)}\,, \mathbf{y}_{(1)}\,, \mathbf{y}_{(2)}$ in figure `15-2`.)
![](./figs/fig.15-1.png)
![](./figs/fig.15-2.png)


More explicitly speaking,

- a (recurrent) neuron's output is always a **scalar**
- a layer of recurrent neurons is a cooperative unit of multiple recurrent neurons. And its output is a **vector**, whose dimension equals the number of neurons in the layer

## Recurrent Neuron
We have

- $y_{(t)} \in \mathbb{R}$ for all time $t$
- $\mathbf{x}_{(t)} \in \mathbb{R}^{n_{\,\text{inputs}}}\;\;$ for all time $t$, where ${n_{\,\text{inputs}}}$ stands for the number of input neurons
- A single neuron's parameters are vectors $\mathbf{w_x} \in \mathbb{R}^{n_{\,\text{inputs}}}\;\;$ and scalars $w_y \in \mathbb{R}, b \in \mathbb{R}$
- An activation function $\phi$
- The formula connecting all these together is $$y_{(t)} = \phi\left(\mathbf{w_x} \cdot \mathbf{x}_{(t)} + w_y y_{(t-1)} + b\right)$$

## A Layer of Recurrent Neurons (To be edited!!!)
We have

For a single sequence,

- $\mathbf{y}_{(t)} \in \mathbb{R}^{n_{\,\text{outputs}}}\;\;\;$ for all time $t$, where ${n_{\,\text{outputs}}}$ stands for the number of output neurons
  - Initially, $\mathbf{y}_{(0)} = \mathbf{0}$
  - For convenience, we denote $k = n_{\,\text{outputs}}$ so that $\mathbf{y}_{(t)} \in \mathbb{R}^k$
- $\mathbf{x}_{(t)} \in \mathbb{R}^{n_{\,\text{inputs}}}\;\;$ for all time $t$
  - For convenience, we denote $n = n_{\,\text{inputs}}$ so that $\mathbf{x}_{(t)} \in \mathbb{R}^n$
- Trainable parameters: Matrices $\hat{W}_{\mathbf{x}} \in M_{k \times n}, \hat{W}_{\mathbf{y}} \in M_{k \times k}\;\;$ and vector $\hat{\mathbf{b}} \in \mathbb{R}^k$
  - The reason for putting the hats will be made clear below
- An activation function $\phi$
- The formula connecting all these together is $$\mathbf{y}_{(t)} = \phi\left(\hat{W}_{\mathbf{x}}\, \mathbf{x} + \hat{W}_{\mathbf{y}}\, \mathbf{y} + \hat{\mathbf{b}}\right)$$
  - The activation $\phi$ simply acts component by component

For a batch of sequences,

- Let $\beta$ be the batch size
- $\mathbf{y}_{(t)}^{(j)} \in \mathbb{R}^{k}$ for all time $t$ and for all batch instance $j$
  - $Y_{(t)} \in \mathbb{R}^{kk}$ (TODO)
- $\mathbf{x}_{(t)}^{(j)} \in \mathbb{R}^{n}$ for all time $t$ and for all batch instance $j$
  - $X_{(t)} \in \mathbb{R}^{}$ (TODO)
- Trainable parameters: Matrices $W_{\mathbf{x}} := \hat{W}_{\mathbf{x}}^{T}, W_{\mathbf{y}} := \hat{W}_{\mathbf{y}}^{T}$ and vector $\mathbf{b} := \hat{\mathbf{b}}^{T}$
  - The reader has probably seen why we had put a hat on the variables earlier
- An activation function $\phi$
- The formula connecting all these together is $$Y_{(t)}= \phi\left(X_{(t)} W_{\mathbf{x}} + Y_{(t-1)} W_{\mathbf{y}} + b\right)$$



### Number of Trainable Parameters
If our understanding above is correct, then with
$$W_{\mathbf{x}} \in M_{n \times k}, W_{\mathbf{y}} \in M_{k \times k}, \mathbf{b} \in \mathbb{R}^k$$
`(# params) = n*k + k*k + k`.

Let's verify whether this is true in `keras`.

In [2]:
import tensorflow.keras as keras

2022-01-13 10:38:43.173593: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-01-13 10:38:43.173632: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [3]:
n = 11
k = 5
simpleRNN_layer = keras.layers.SimpleRNN(k, input_shape=(None, n))

In [4]:
[s for s in dir(simpleRNN_layer) if not s.startswith("_")]

['activation',
 'activity_regularizer',
 'add_loss',
 'add_metric',
 'add_update',
 'add_variable',
 'add_weight',
 'apply',
 'bias_constraint',
 'bias_initializer',
 'bias_regularizer',
 'build',
 'built',
 'call',
 'cell',
 'compute_dtype',
 'compute_mask',
 'compute_output_shape',
 'compute_output_signature',
 'constants_spec',
 'count_params',
 'dropout',
 'dtype',
 'dtype_policy',
 'dynamic',
 'finalize_state',
 'from_config',
 'get_config',
 'get_initial_state',
 'get_input_at',
 'get_input_mask_at',
 'get_input_shape_at',
 'get_losses_for',
 'get_output_at',
 'get_output_mask_at',
 'get_output_shape_at',
 'get_updates_for',
 'get_weights',
 'go_backwards',
 'inbound_nodes',
 'input',
 'input_mask',
 'input_shape',
 'input_spec',
 'kernel_constraint',
 'kernel_initializer',
 'kernel_regularizer',
 'losses',
 'metrics',
 'name',
 'name_scope',
 'non_trainable_variables',
 'non_trainable_weights',
 'outbound_nodes',
 'output',
 'output_mask',
 'output_shape',
 'recurrent_constraint

In [None]:
simpleRNN_layer.trainable_weights

In [6]:
simpleRNN_layer.trainable

True

In [7]:
simpleRNN_layer.trainable_variables

[]

In [8]:
simpleRNN_layer.variables

[]

In [9]:
simpleRNN_layer.weights

[]

In [11]:
simpleRNN_layer.get_config()

{'name': 'simple_rnn',
 'trainable': True,
 'batch_input_shape': (None, None, 11),
 'dtype': 'float32',
 'return_sequences': False,
 'return_state': False,
 'go_backwards': False,
 'stateful': False,
 'unroll': False,
 'time_major': False,
 'units': 5,
 'activation': 'tanh',
 'use_bias': True,
 'kernel_initializer': {'class_name': 'GlorotUniform',
  'config': {'seed': None}},
 'recurrent_initializer': {'class_name': 'Orthogonal',
  'config': {'gain': 1.0, 'seed': None}},
 'bias_initializer': {'class_name': 'Zeros', 'config': {}},
 'kernel_regularizer': None,
 'recurrent_regularizer': None,
 'bias_regularizer': None,
 'activity_regularizer': None,
 'kernel_constraint': None,
 'recurrent_constraint': None,
 'bias_constraint': None,
 'dropout': 0.0,
 'recurrent_dropout': 0.0}

In [12]:
simpleRNN_layer.get_weights()

[]

It seems that we can only get the trainable parameters after including the layer in a model.

In [14]:
model = keras.Sequential([simpleRNN_layer])

2022-01-13 10:53:36.546775: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-01-13 10:53:36.546909: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-01-13 10:53:36.547033: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (mushroom-x200): /proc/driver/nvidia/version does not exist


In [21]:
# This gives pretty much the same attributes as above
#[s for s in dir(model.layers[0]) if not s.startswith("_")]

The attributes

- `weights`
- `trainable_weigths`
- `variables`
- `trainable_variables`

all gives the same list of `[W_x, W_y, b]`

In [17]:
model.layers[0].weights

[<tf.Variable 'simple_rnn/simple_rnn_cell/kernel:0' shape=(11, 5) dtype=float32, numpy=
 array([[-0.39205524,  0.6056785 ,  0.00976646,  0.2922976 , -0.54178315],
        [ 0.58913845, -0.40885615, -0.06774998,  0.17361706,  0.06441653],
        [-0.39094067, -0.11464536, -0.39618343, -0.2597798 , -0.42601216],
        [-0.48977983,  0.5518094 , -0.54704314,  0.32235378, -0.48730862],
        [-0.6051304 ,  0.36912167, -0.3797627 , -0.05026609,  0.0447852 ],
        [ 0.12057722, -0.4491092 ,  0.5895316 , -0.4219625 ,  0.06617713],
        [-0.6115918 ,  0.04476041,  0.38420504, -0.16691211, -0.04666221],
        [-0.25456628, -0.1030075 ,  0.5951409 ,  0.1744259 ,  0.4264763 ],
        [ 0.05671275,  0.43728095,  0.35596383,  0.50579876, -0.33917367],
        [-0.35362354, -0.2584312 ,  0.45808452, -0.12331593,  0.0307436 ],
        [-0.50871116, -0.11249492,  0.36809516,  0.2002685 , -0.3210538 ]],
       dtype=float32)>,
 <tf.Variable 'simple_rnn/simple_rnn_cell/recurrent_kernel:0' 

The method `get_weights()` gives pretty much the same list but with members as `np.ndarray` rather than `tf.Variable`

In [20]:
model.layers[0].get_weights()

[array([[-0.39205524,  0.6056785 ,  0.00976646,  0.2922976 , -0.54178315],
        [ 0.58913845, -0.40885615, -0.06774998,  0.17361706,  0.06441653],
        [-0.39094067, -0.11464536, -0.39618343, -0.2597798 , -0.42601216],
        [-0.48977983,  0.5518094 , -0.54704314,  0.32235378, -0.48730862],
        [-0.6051304 ,  0.36912167, -0.3797627 , -0.05026609,  0.0447852 ],
        [ 0.12057722, -0.4491092 ,  0.5895316 , -0.4219625 ,  0.06617713],
        [-0.6115918 ,  0.04476041,  0.38420504, -0.16691211, -0.04666221],
        [-0.25456628, -0.1030075 ,  0.5951409 ,  0.1744259 ,  0.4264763 ],
        [ 0.05671275,  0.43728095,  0.35596383,  0.50579876, -0.33917367],
        [-0.35362354, -0.2584312 ,  0.45808452, -0.12331593,  0.0307436 ],
        [-0.50871116, -0.11249492,  0.36809516,  0.2002685 , -0.3210538 ]],
       dtype=float32),
 array([[-0.21763706,  0.06546668,  0.7511529 ,  0.619652  , -0.01220071],
        [-0.36351594, -0.8713754 , -0.05070263,  0.01944964, -0.32498005],
 

There are even `non_trainable_variables` and `non_trainable_weights` attributes. But in this case they consists of nothing.

In [23]:
model.layers[0].non_trainable_variables

[]

In [24]:
model.layers[0].non_trainable_weights

[]

**(?)** Why must we put the layer into a model before being able to inspect its weights?

## Memory Cells
In a more sophisticated setting, there is also sth called **hidden state**, usually noted as $\mathbf{h}_{(t)}\,.$
And the common practice is

- let $\mathbf{h}_{(t)} = f(\mathbf{h}_{(t-1)}\,, \mathbf{x}_{(t)}\,)$ for some function $f$
- let $\mathbf{y}_{(t)} = g(\mathbf{h}_{(t-1)}\,, \mathbf{x}_{(t)}\,)$ for some function $g$.

In what we discussed above (for the simplest case), the output $\mathbf{y}_{(t)}$ plays the role of a hidden state $\mathbf{h}_{(t)}$ and there was no $\mathbf{h}_{(t)}$. But further in this chapter, we will encounter more sophisticated RNNs which do make use of hidden states.