## <span style="color:orange">Recurrent Models with FluxML</span>

#### <span style="color:orange"> Flux does offer out of the box a set of recurrence functionalities in specific layers. Remember that recurrent models can come across as being more complicated than necessary. In general we are still dealing with the same type of funcational relationship, $\hat{y} = f(X_i) = f_{rnn}(X_t)$ where previously y_hat was either a single dimension or multiple dimensions, here $y_{hat} = [y_t , h_t]$ where $h_t$ is an input into the new time point (memory carry on) so that we have $X_t = [ x_t , h_{t-1}]$. The dependency can be seen as $\hat{y}_t = f(x_t,h_{t-1}), h_t = g(h_{t-1},x_t)$ </span>

- This basic recurrence relationship says that at each point we take the $h_{t-1}$ from the previous time step 't-1', we also use the current inputs at time t, $x_t$ and then produce an output which is has 2 components, $y_t$ and $h_t$ (where $h_t$ feeds into $t+1$). If we focus on a single time point, we are still doing the functional mapping that we had before with the chain of Dense layers.

- The below image [link source](https://www.google.com/url?sa=i&url=https%3A%2F%2Fmedium.com%2Fswlh%2Fintroduction-to-recurrent-neural-networks-rnns-347903dd8d81&psig=AOvVaw3xmMdMdDizNUUWXy021QUO&ust=1673451409556000&source=images&cd=vfe&ved=0CBAQjhxqFwoTCOCStLiqvfwCFQAAAAAdAAAAABA_) shows how this looks in a model diagram for **seq-to-seq** (**many to many**)

![rnn](./rnn1.png)

- also The below image [link source](https://www.ibm.com/topics/recurrent-neural-networks) shows how this looks

![rnn](./rnn2.jpeg)

----------------

#### <span style="color:orange"> $W_h$ is the weight matrix (tranformation) on the 'hidden inputs' $h_{t-1}$ that come from the previous unit's 'hidden' output, $W_x$ is the weight matrix (transformation) upon the inputs at the current time $x_t$. $W_y$ is the transformation weight matrix applied to the 'current' hidden value produced from the cell that after a non-linear transformation (activation function) produces the output $\hat{y}_t$. Training involves learning the values of the weights/parameters for these matrices.</span>

### <span style="color:orange"> $h_t = tahn(b_h + W_h^t h_{t-1} + W_x^t x_t)$ </span>
### <span style="color:orange"> $\hat{y}_t = softmax(b_y + W_{y}^{t} h_t)$ </span>

In [12]:
import Pkg
Pkg.status("Flux")

[32m[1mStatus[22m[39m `~/.julia/environments/v1.8/Project.toml`
 [90m [587475ba] [39mFlux v0.13.11


In [13]:
using Flux
using Zygote
using Plots

In [14]:
#example of usage of the RNN unit
h_dim = 5 #hidden dimension
x_dim = 2 #input dimension at time t
y_dim = 1 #the output dimension that 'we' see

rnn_tmp = RNN( x_dim , h_dim ) #produces the cell
x_t1 = Float32.( [1,2] ) #some arbitrary input data
println( rnn_tmp( x_t1 ) ) #print the output h_t1 from the cell

Float32[-0.9805636, 0.9029308, -0.021556435, -0.3254551, 0.8538051]


### <span style="color:orange"> It is key to know that this cell is different from the Dense layers in that it maintains the state between executions since it is **stateful**. This means it holds the state via a closure inside the function reference</span>

In [15]:
x_t1 = Float32.( [1,2] ) #some arbitrary input data
println( rnn_tmp( x_t1 ) ) #print the output h_t1 from the cell
#print multiple times to see the changes
display( [ rnn_tmp( x_t1 ) for _ in 1:4 ] )

Float32[-0.9032693, 0.8045727, -0.42605814, -0.8635573, 0.25629073]


4-element Vector{Vector{Float32}}:
 [-0.9606528, 0.8979496, -0.11266329, -0.7910778, 0.6891953]
 [-0.9167843, 0.8792115, -0.43906084, -0.8651189, 0.39828205]
 [-0.95545655, 0.8977845, -0.17973875, -0.80005825, 0.62877864]
 [-0.9265469, 0.8820714, -0.38817954, -0.8545371, 0.4493794]

### <span style="color:orange"> Since the RNN unit maintains and handles the state between subsequent uses we can abstractly use it in the ML pipeline as we used the Dense layer. From above notice that we produced hidden representation responses from inputs, but not the predictions $\hat{y}$ since those are done separately. </span>

### <span style="color:orange"> The RNN function implements $h_t = tahn(b_h + W_h^t h_{t-1} + W_x^t x_t)$ but $\hat{y}_t = softmax(b_y + W_{y}^{t} h_t)$ is not. The $W_{y}$ matrix is not provided by the RNN layer and must be supplied by the user. </span>

In [16]:
#we feed the model with 'x_dim' data, it produces a hidden vector 'h_dim' and outputs a 'y_dim' vector at each time
rnn_model1 = Chain( RNN( x_dim => h_dim ) , Dense( h_dim => y_dim ) ) 

Chain(
  Recur(
    RNNCell(2 => 5, tanh),              [90m# 45 parameters[39m
  ),
  Dense(5 => 1),                        [90m# 6 parameters[39m
) [90m        # Total: 6 trainable arrays, [39m51 parameters,
[90m          # plus 1 non-trainable, 5 parameters, summarysize [39m580 bytes.

In [17]:
#try out the model
rnn_model1( x_t1 )

1-element Vector{Float32}:
 -1.3174596

In [18]:
[ rnn_model1( x_t1 ) for _ in 1:4 ]

4-element Vector{Vector{Float32}}:
 [-2.598038]
 [-2.507431]
 [-2.5310478]
 [-2.5289788]

In [22]:
#test the model with hypothetical data
x_length = 5
#generate some random data as inputs, to be treated as a sequence
x_seq = [ rand( Float32 , x_dim ) for i = 1:x_length ] #sequence data
[ rnn_model1( xt ) for xt in x_seq ] #predicted y_t data from the RNN
#this is <sequence to sequence> <many to many>

5-element Vector{Vector{Float32}}:
 [-1.9368374]
 [-0.85135067]
 [0.20908314]
 [-1.7417446]
 [-2.3559318]

[link source](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.codeproject.com%2FArticles%2F3993967%2FApplying-Long-Short-Term-Memory-for-Video-Classifi&psig=AOvVaw3aJjMa8pFFZ1i53Lk2C-gE&ust=1673466988547000&source=images&cd=vfe&ved=0CBAQjhxqFwoTCNjijr7kvfwCFQAAAAAdAAAAABA4)

![rnn](./rnnTypes.jpg)

Let's consider the <u>**sequence to one**</u> (seq-to-one) now

In [31]:

#loss function for 3-to-1, a sequence of 3 inputs for a single output
#this can be used in the training scheme as before
function loss_3_to_1( x , y ) #assume feature data has 3 samples
    rnn_model1( x[1] ) # ignores the output but updates the hidden states
    rnn_model1( x[2] ) # ignores the output but updates the hidden states again
    y_hat = rnn_model1( x[3] )
    println( y_hat )
    Flux.mse( y_hat , y )
end

y = rand( Float32 , y_dim ) #target data
x = [ rand(Float32, x_dim ) for i=1:3 ] #sequence of 3 x values
println("y=",y," x=",x)
loss_3_to_1( x , y )

y=Float32[0.6826063] x=Vector{Float32}[[0.4168681, 0.7063472], [0.04967791, 0.73665863], [0.92430115, 0.35278362]]
Float32[-1.3182502]


4.0034266f0

In [36]:
#produce a hypothetical sequence of data points in x_dim dimensions
#with y_dim data and pass that to the loss
x_data = [ [rand(Float32,x_dim) for i=1:3] for j=1:10 ]
y_data = [ rand( Float32 , y_dim ) for j=1:10 ]
data = zip( x_data , y_data ) #pack the data into pairs

for (x_tmp,y_tmp) in data
    println( "loss = " , loss_3_to_1( x_tmp , y_tmp ) )
end

Float32[-2.5044782]
loss = 8.718225
Float32[-1.3792447]
loss = 3.523202
Float32[-2.0760872]
loss = 6.632331
Float32[-2.1436944]
loss = 9.492607
Float32[-1.131919]
loss = 2.2251892
Float32[-0.62875444]
loss = 1.1021379
Float32[0.08114448]
loss = 0.74856144
Float32[-2.4292984]
loss = 9.314036
Float32[-1.6843594]
loss = 3.248972
Float32[-1.7381079]
loss = 6.9522743


<span style="color:orange"> If you need to use the RNN and not have it dependent upon the previous state (eg. independent sentences), then you can use the **Flux.reset!(rnn_model)** command so the previous history variables are removed </span>

In [None]:
#add a reset so that each sentence is taken independently
function loss_3_to_1_reset( x , y ) #assume feature data has 3 samples
    Flux.reset!( rnn_model1 ) #reset the model from previous sentences
    rnn_model1( x[1] ) # ignores the output but updates the hidden states
    rnn_model1( x[2] ) # ignores the output but updates the hidden states again
    y_hat = rnn_model1( x[3] )
    Flux.mse( y_hat , y )
end

In [20]:
### <span style="color:orange">write a function and make a function for the gradient of the function</span>