## <span style="color:orange">Recurrent Models with FluxML</span>

#### <span style="color:orange"> Flux does offer out of the box a set of recurrence functionalities in specific layers. Remember that recurrent models can come across as being more complicated than necessary. In general we are still dealing with the same type of funcational relationship, $\hat{y} = f(X_i) = f_{rnn}(X_t)$ where previously y_hat was either a single dimension or multiple dimensions, here $y_{hat} = [y_t , h_t]$ where $h_t$ is an input into the new time point (memory carry on) so that we have $X_t = [ x_t , h_{t-1}]$. The dependency can be seen as $\hat{y}_t = f(x_t,h_{t-1}), h_t = g(h_{t-1},x_t)$ </span>

- This basic recurrence relationship says that at each point we take the $h_{t-1}$ from the previous time step 't-1', we also use the current inputs at time t, $x_t$ and then produce an output which is has 2 components, $y_t$ and $h_t$ (where $h_t$ feeds into $t+1$). If we focus on a single time point, we are still doing the functional mapping that we had before with the chain of Dense layers.

- The below image [link source](https://www.google.com/url?sa=i&url=https%3A%2F%2Fmedium.com%2Fswlh%2Fintroduction-to-recurrent-neural-networks-rnns-347903dd8d81&psig=AOvVaw3xmMdMdDizNUUWXy021QUO&ust=1673451409556000&source=images&cd=vfe&ved=0CBAQjhxqFwoTCOCStLiqvfwCFQAAAAAdAAAAABA_) shows how this looks

![rnn](./rnn1.png)

- also The below image [link source](https://www.ibm.com/topics/recurrent-neural-networks) shows how this looks

![rnn](./rnn2.jpeg)

----------------

#### <span style="color:orange"> $W_h$ is the weight matrix (tranformation) on the 'hidden inputs' $h_{t-1}$ that come from the previous unit's 'hidden' output, $W_x$ is the weight matrix (transformation) upon the inputs at the current time $x_t$. $W_y$ is the transformation weight matrix applied to the 'current' hidden value produced from the cell that after a non-linear transformation (activation function) produces the output $\hat{y}_t$. Training involves learning the values of the weights/parameters for these matrices.</span>

### <span style="color:orange"> $h_t = tahn(b_h + W_h^t h_{t-1} + W_x^t x_t)$ </span>
### <span style="color:orange"> $\hat{y}_t = softmax(b_y + W_{y}^{t} h_t)$ </span>

In [1]:
import Pkg
Pkg.status("Flux")

[32m[1mStatus[22m[39m `~/.julia/environments/v1.8/Project.toml`
[32m⌃[39m[90m [587475ba] [39mFlux v0.13.10
[36m[1mInfo[22m[39m Packages marked with [32m⌃[39m have new versions available and may be upgradable.


In [2]:
using Flux
using Zygote
using Plots

In [29]:
#example of usage of the RNN unit
h_dim = 5 #hidden dimension
x_dim = 2 #input dimension at time t
y_dim = 1 #the output dimension that 'we' see

rnn_tmp = RNN( x_dim , h_dim ) #produces the cell
x_t1 = Float32.( [1,2] ) #some arbitrary input data
println( rnn_tmp( x_t1 ) ) #print the output h_t1 from the cell

Float32[0.024841143, -0.7917899, 0.673808, -0.89635473, 0.5669292]


### <span style="color:orange"> It is key to know that this cell is different from the Dense layers in that it maintains the state between executions since it is **stateful**. This means it holds the state via a closure inside the function reference</span>

In [28]:
x_t1 = Float32.( [1,2] ) #some arbitrary input data
println( rnn_tmp( x_t1 ) ) #print the output h_t1 from the cell
#print multiple times to see the changes
display( [ rnn_tmp( x_t1 ) for _ in 1:4 ] )

Float32[-0.49927357, 0.71427214, 0.9833088, 0.9815449, -0.9947405]


4-element Vector{Vector{Float32}}:
 [-0.5368339, 0.7413628, 0.98381394, 0.98049414, -0.9953573]
 [-0.55849344, 0.7566783, 0.98412806, 0.9798016, -0.99569243]
 [-0.57055, 0.76511675, 0.9843097, 0.9793839, -0.995873]
 [-0.57712597, 0.76969594, 0.9844107, 0.9791454, -0.9959696]

### <span style="color:orange"> Since the RNN unit maintains and handles the state between subsequent uses we can abstractly use it in the ML pipeline as we used the Dense layer. From above notice that we produced hidden representation responses from inputs, but not the predictions $\hat{y}$ since those are done separately. </span>

### <span style="color:orange"> The RNN function implements $h_t = tahn(b_h + W_h^t h_{t-1} + W_x^t x_t)$ but $\hat{y}_t = softmax(b_y + W_{y}^{t} h_t)$ is not. The $W_{y}$ matrix is not provided by the RNN layer and must be supplied by the user. </span>

In [31]:
#we feed the model with 'x_dim' data, it produces a hidden vector 'h_dim' and outputs a 'y_dim' vector at each time
rnn_model1 = Chain( RNN( x_dim => h_dim ) , Dense( h_dim => y_dim ) ) 

Chain(
  Recur(
    RNNCell(2 => 5, tanh),              [90m# 45 parameters[39m
  ),
  Dense(5 => 1),                        [90m# 6 parameters[39m
) [90m        # Total: 6 trainable arrays, [39m51 parameters,
[90m          # plus 1 non-trainable, 5 parameters, summarysize [39m580 bytes.

In [33]:
#try out the model
rnn_model1( x_t1 )

1-element Vector{Float32}:
 0.51586723

In [36]:
[ rnn_model1( x_t1 ) for _ in 1:4 ]

4-element Vector{Vector{Float32}}:
 [0.6140427]
 [0.61374307]
 [0.61424994]
 [0.61421525]

In [37]:
x_length = 5
x_seq = [ rand( Float32 , x_dim ) for i = 1:x_length ]
[ rnn_model1( xt ) for xt in x_seq ]

5-element Vector{Vector{Float32}}:
 [-0.07165283]
 [-0.076300934]
 [0.18530461]
 [0.023417383]
 [-0.023466706]

In [None]:
### <span style="color:orange">write a function and make a function for the gradient of the function</span>