In [None]:
import numpy as np
from tensorflow.keras.preprocessing import timeseries_dataset_from_array
import sys
sys.path.append("..")
from moisture_rnn import staircase, staircase_2

# Input Data Structure for RNNs

## Background

RNNs are a type of timeseries model that relates the outcome at time $t$ to the outcome at previous times. Like other machine learning models, training is typically done by calculating the gradient of the output with respect to the weights, or parameters, of the model. With recursive or other type of autoregressive models, the gradient calculation at time $t$ ends up depending on the gradient at $t-1, t-2, ...,$ and to $t=0$. This ends up being computationally expensive, but more importantly can lead to "vanishing" or "exploding" gradient problems, where many gradients are multiplied together and either blow up or shrink. See LINK_TO_RECURSIVE_GRADIENT_LATEX for more info...

RRNs and other timeseries neural network architectures* get around this issue by approximating the gradient in more stable ways. In addition to many model architecture and hyperparameter options, timeseries neural networks use two main ways of restructuring the input data.

* **Sequence Length:** The input data is divided into smaller collections of ordered events, known as sequences. When calculating the gradient with respect to the model weights, the gradient only looks back this number of timesteps. Also known as `timesteps`, `sequence_length`, or just "sample" in `tensorflow` functions and related literature. For a sequence length of 3, the data would be broken up as: `[1,2,3], [2,3,4], ..., [T-2, T-1, T]`, for a total number of sequences `T-timesteps+1`

* **Batch size:** Sequences are grouped together into batches. The batch size determines how many sequences the network processes in a single step of calculating loss and updating weights. Used as `batch_size` in `tensorflow`.

The total number of batches is therefore determined from the total number of observations $T$ and the batch size. In a single batch, the loss is typically calculated for each sequence and then averaged to produce a single value. Then, the gradient of the loss with respect to the parameters (weights and biases) is computed for each sequence in the batch. So each batch will have a single gradient calculation that is the sum of the gradients of each sequence in the batch.

The function `timeseries_dataset_from_array` from `keras.preprocessing` can be used for this purpose, and we will compare that to a custom function `staircase` that demonstrates the method in detail.

**Note:* these same data principles apply to more complex versions of timeseries neural network layers, such as LSTM and GRU.

## Stateless vs Stateful Networks

RNNs have a hidden state that represents the recurrent layer output at a previous time. There is a weight and bias at each RNN cell that determines the relative contribution of the previous output to the current output. When updating weights in RNNs, there are two main types of training scheme:

**Stateless:** the hidden state is reset to the initial state (often zero) at the start of each new sequence in a batch. So, the network treats each sequence independently, and no information is carried over in time between sequences. These models are simpler, but work better when time dependence is relatively short.
* **Input Data Shape:** (`n_sequence`, `timesteps`, `features`), where `n_sequence` is total number of sequences (a function of total observed times `T` and the user choice of timesteps). The input data does NOT need to be structured with batch size in stateless RNNs.
* **Tensorflow RNN Args:** for a stateless RNN, use the `input_shape` parameter, with `input_shape`=(`timesteps`, `features`). Then, `batch_size` can be declared in the fitting stage with `model.fit(X_train, batch_size = __)`. 

**Stateful:** the hidden states are carried over from one sequence to the next within a batch. Longer time dependencies can be learned in this way.
* **Input Data Shape:** (`batch_size`, `timesteps`, `features`). In order for the hidden state to be passed between sequences, the input data must be formatted using the `batch_size` hyperparameter.
* **Tensorflow RNN Args:** for a stateful RNN, use the `batch_input_shape` parameter, with `batch_input_shape`=(`batch_size`, `timesteps`, `features`)

## Examples

### Data Description

Consider $T=100$ observations of a variable in time $y_t$, so $t=1, ..., 100$. A feature matrix with $3$ features has dimensions $100\times 3$, and must be restructured for use in RNNs. In the code below each feature is a sequential list 1, 2, ..., 100 (since python numbering starts at 0).

In [None]:
T=100 # total number of times obseved
features = 3

# data = np.column_stack(
#     (np.arange(1, 101)+10, 
#      np.arange(1, 101)+20, 
#      np.arange(1, 101)+30))
data = np.arange(1, 101).reshape(-1, 1)

# Generate random response vector, needed by staircase func
y = np.arange(1, 101).reshape(-1, 1)

print(f"Response Data Shape: {y.shape}")
print("First 10 rows")
print(y[0:10])

# Print head of data
print(f"Feature Data Shape: {data.shape}")
print("First 10 rows")
data[0:10,:]

The rows of the input data array represent all features at a single timepoint. The first digit represents the feature number, and the second digit represents time point. So value $13$ represents feature 1 at time 3.

### Stateless Example

With a stateless RNN, the input data is structured to be of shape (`n_sequence`, `timesteps`, `features`). The `batch_size` is not needed to structure the data for a stateless RNN.

When using functions that expect `batch_size` to structure the data, an option is to set `batch_size` to be some large number greater than the total number of observed times $T$, so that all the data is guarenteed to be in one batch. *NOTE:* here we trick the function by using a large batch size, but `batch_size` could still be declared at the fitting stage of the model.

Suppose in we use `timesteps=5`, so we would get sequences of times `[1,2,3,4,5]` for the first sequence, then `[2,3,4,5,6]` for the next, and so on until `[96,97,98,99,100]`. 

Thus, there are `100-5+1=96` possible ordered sequences of length `5`. 

We need to structure the input data to the RNN to be of shape (96, 5, 3). *Note:* since the model is stateless, and the sequences are treated independently, the actual order of the sequences doesn't matter.

For a stateless RNN, the batches could consist of any collection of the sequences, since the sequences are indepenent. 

We want all of the data sequences to be in a single batch, but number of batches is not a direct user input for most built-in functions. To get around this, we make the batch size some number larger than the total number of observed times.

We now recreate this using the custom `staircase` function, which produces the same results for the input data:

In [None]:
X_train, y_train = staircase(data, y, timesteps = 5, datapoints = len(y), verbose=True)

In [None]:
print(X_train.shape)
print(y_train.shape)

In [None]:
print(X_train)

### Stateful Example

We now need the data in the format (`batch_size`, `timesteps`, `features`) for each batch. 

In [None]:
X_train, y_train = staircase_2(data, y, timesteps = 5, batch_size = 32, verbose=False)

In [None]:
X_train

## References

https://d2l.ai/chapter_recurrent-neural-networks/bptt.html

https://www.tensorflow.org/guide/keras/working_with_rnns#cross-batch_statefulness

Tensorflow `timeseries_dataset_from_array` tutorial: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/timeseries_dataset_from_array

Wiki BPTT: https://en.wikipedia.org/wiki/Backpropagation_through_time#:~:text=Backpropagation%20through%20time%20(BPTT)%20is,independently%20derived%20by%20numerous%20researchers.

https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/