# Deep Recurrent Neural Networks (RNNs): Stacked Architectures

## I. Introduction and Motivation

### A. Defining Deep RNNs
*   Deep RNNs are also referred to as **Stacked RNNs**.
*   The basic concept involves **stacking multiple layers of RNNs** vertically to solve more complex types of problems.
*   The same concept used to create complex, multi-layered Artificial Neural Networks (ANNs) by adding more hidden layers can be applied to RNNs.

### B. Analogy: From Shallow to Deep ANNs
The concept of stacking layers to increase complexity originated in standard ANNs:
1.  **Simple ANNs:** If a neural network (NN) performs poorly (e.g., struggling to map a complex dataset like the spiral dataset), the initial response might be to increase the **number of neurons (units)** in the hidden layer.
2.  **Increased Power:** Increasing the number of neurons improves the network's **representation power** and its ability to effectively find patterns in the data.
3.  **Deeper ANNs:** If increasing nodes is insufficient, you can add **more hidden layers**. Adding more hidden layers and nodes increases the neural network's complexity and improves results.
4.  **Deep RNNs:** This concept of making a network more complex by adding layers is directly applied to RNNs to create Deep RNNs.

## II. Deep RNN Architecture and Information Flow

### A. Structure: Single Layer vs. Stacked Layers
*   A **standard (single-layer) RNN** cell receives the current input ($X_t$) and the previous time step's hidden state ($H_{t-1}$), performs a $\tanh$ operation, and outputs the current hidden state ($H_t$).
*   In a Deep RNN, multiple RNN cells (hidden layers) are **stacked vertically**.
*   There is no limit to how many hidden layers can be stacked vertically to increase the representation power of the RNN.

### B. Input Flow in Stacked Layers
In a Deep RNN structure (e.g., with two hidden layers, Layer 1 and Layer 2):
1.  **Layer 1 (The first hidden layer):** Receives the input directly from the **Input Layer** (the current word vector, $X_t$). It also receives input from its own previous time step.
2.  **Layer 2 (The second hidden layer):** Receives its vertical input from the **output of Hidden Layer 1**. It also receives input from its own previous time step.
3.  **Subsequent Layers:** A potential third hidden layer would receive its input from the output of Hidden Layer 2.

### C. Details of Internal Connections
*   When calculating the output for a given time step $t$ in a layer $L$, the calculation involves inputs flowing both across time (horizontal) and across depth (vertical).
*   The main connecting feature in an RNN is the **feedback loop** (or state), which in a simple RNN with 3 units receiving 3-dimensional input, consists of a $3 \times 3$ connection.
*   The **number of units** (nodes) in different stacked RNN layers can be the same or different. For example, Layer 1 might have 3 units and Layer 2 might have 2 units.
    *   The connection between the Input (3 dimensions) and Hidden Layer 1 (3 units) is $3 \times 3$.
    *   The connection between Hidden Layer 1 (3 units) and Hidden Layer 2 (2 units) is $3 \times 2$.

## III. Notation and General Equation

### A. Coordinate System
To identify components within a Deep RNN, which forms a 2D grid structure, two axes are used:
1.  **Time Axis ($T$):** Represents the sequence/time steps.
2.  **Depth Axis ($L$):** Represents the hidden layers (how deep the RNN is).

### B. Cell Identification (Hidden State)
*   Any particular RNN cell is identified using the notation: **$H_t^L$**.
    *   $H$: Hidden unit.
    *   $T$: Current Time step (e.g., 3).
    *   $L$: Current Layer (e.g., 2, shown as a superscript).

### C. Inputs to a Cell ($H_t^L$)
A cell $H_t^L$ receives two main inputs:
1.  **Input from the Previous Layer (Vertical Flow):** This comes from the same time step ($T$) but the previous layer ($L-1$).
    *   Notation: $H_t^{L-1}$.
2.  **Input from the Previous Time Step (Horizontal Flow):** This comes from the previous time step ($T-1$) but the same layer ($L$).
    *   Notation: $H_{t-1}^L$.

### D. The General Deep RNN Equation
The calculation for the output of an RNN cell $H_t^L$ relies on the $\tanh$ activation function and two sets of weights ($W$ and $U$):

$$H_t^L = \tanh(W^L \cdot H_{t-1}^L + U^L \cdot H_t^{L-1} + B^L)$$

*   $W^L$: Weight associated with the recurrent connection from the previous time step ($H_{t-1}^L$) in Layer $L$.
*   $U^L$: Weight associated with the input from the previous layer ($H_t^{L-1}$) in Layer $L$.
*   $B^L$: Bias term for Layer $L$.

## IV. Why and When to Use Deep RNNs

### A. Primary Motivation: Handling Data Complexity
The biggest motivation for using Deep RNNs is to handle the **complexity of the data**.

1.  **Hierarchical Representation:** Deep RNNs can capture the inherent hierarchy present in data, such as natural language.
    *   **Initial Layers (Shallow):** Find very basic or **primitive features** (e.g., identifying individual words like "love," "hate," or "terrible"). They work at the **word-by-word level**.
    *   **Middle Layers:** Focus on processing information at the **sentence or phrase level** (e.g., understanding "Audio is bad").
    *   **Higher Layers (Deep):** Use the accumulated information from previous layers to determine the **overall sentiment** or context of the entire input (e.g., integrating "Audio is bad" with "Display is great" to conclude "Overall I am happy").

2.  **Customization for Advanced Tasks:** Deep RNNs allow for the construction of advanced level architectures.
    *   They can be used to build components like the **Encoder-Decoder module** often utilized in machine translation, potentially incorporating an **attention mechanism**. Google Translate, for instance, has historically used Deep RNNs.

### B. Situational Guidelines (When to use Deep RNNs)
Deep RNNs are preferred over single-layer RNNs in several scenarios:
1.  **Complex Problem Statements:** Such as speech recognition or machine translation.
2.  **Large Datasets:** Deep RNNs require **sufficient data**. Using them with less data increases the risk of **overfitting**.
3.  **Sufficient Computational Resources:** Due to their complexity, training Deep RNNs requires enough processing power and computational resources.
4.  **Unsatisfactory Simpler Models:** If a single-layer RNN (the baseline model) does not provide satisfactory results, a Deep RNN can be attempted as an improvement.

## V. Extension to LSTMs and GRUs

### A. Applicability to Gated Architectures
The concept of stacking is not limited to simple RNNs; it is applicable to other RNN architectures:
*   **Deep LSTMs:** Stacking one Long Short-Term Memory (LSTM) layer on top of another.
*   **Deep GRUs:** Stacking Gated Recurrent Units (GRUs).

### B. Practical Usage
*   In practical deep learning, the Deep RNN concept is **most frequently applied** to LSTMs (Deep LSTMs) or GRUs (Deep GRUs).
*   Simple Deep RNNs are **generally not used** because they still suffer from the vanishing and exploding gradient problems inherent in simple RNNs.

### C. Implementation Detail (Keras)
When implementing a stacked architecture (Deep RNN, Deep LSTM, or Deep GRU), a crucial parameter setting in Keras is `return_sequences=True`:
*   For the intermediate hidden layers (all layers *except* the last hidden layer), `return_sequences` **must be set to True**.
*   If `return_sequences=False` is set for an intermediate layer, the vertical connection (information flow) to the next layer is broken, and the stacked architecture will not function correctly.
*   The **last hidden layer** typically sets `return_sequences=False` if the output is only needed from the final time step.

## VI. Disadvantages

Using Deep RNNs introduces complexities and drawbacks:
1.  **Increased Risk of Overfitting:** Due to the increased model complexity, careful network design, proper use of techniques like **dropout** and **regularization**, and careful setting of the learning rate and weight initialization are required to prevent overfitting (where performance is high on training data but low on test data).
2.  **Increased Training Time:** Training time increases because Deep RNNs have a **higher number of parameters**. Backpropagation must be performed for many more parameters, and derivatives must be calculated and updated, which takes more time, especially with large datasets.