# 13 - GRU from Scratch

Gated Recurrent Units (GRUs) are a simplified variant of LSTMs that also address the vanishing gradient problem in sequence modeling. GRUs are used in some language models and offer a more efficient alternative to LSTMs.

In this notebook, you'll scaffold the core logic of a GRU cell and see how it compares to LSTMs and transformers.

## 🔢 GRU Cell: Gates and State

A GRU cell uses two gates to control information flow:
- Update gate
- Reset gate

**LLM/Transformer Context:**
- GRUs, like LSTMs, were designed to model long-range dependencies in sequences. Transformers generalize these ideas with self-attention.

### Task:
- Scaffold a function for a single GRU cell step (forward pass).
- Add a docstring explaining the role of each gate and state.

In [None]:
def gru_cell_step(x_t, h_prev, params):
    """
    Compute one step of a GRU cell.
    Args:
        x_t (np.ndarray): Input at time t (input_dim,)
        h_prev (np.ndarray): Previous hidden state (hidden_dim,)
        params (dict): GRU parameters (weights and biases for all gates)
    Returns:
        np.ndarray: Current hidden state (hidden_dim,)
    """
    # TODO: Implement the GRU cell computation (gates, candidate state, hidden state)
    pass

## 🔗 Unrolling the GRU Over a Sequence

To process a sequence, the GRU cell is applied at each time step, passing the hidden state forward.

**LLM/Transformer Context:**
- GRUs can model longer dependencies than vanilla RNNs, but transformers go even further with self-attention.

### Task:
- Scaffold a function to run a GRU over an entire input sequence.
- Add a docstring explaining the sequence processing.

In [None]:
def gru_forward(X_seq, h0, params):
    """
    Run a GRU over an input sequence.
    Args:
        X_seq (np.ndarray): Input sequence (seq_len x input_dim)
        h0 (np.ndarray): Initial hidden state (hidden_dim,)
        params (dict): GRU parameters
    Returns:
        list: List of hidden states for each time step
    """
    # TODO: Implement GRU unrolling over the sequence
    pass

## 🧮 Output Layer: From Hidden State to Prediction

After processing the sequence, the GRU's hidden state(s) are mapped to output predictions (e.g., next token probabilities).

**LLM/Transformer Context:**
- This is analogous to the output projection in transformers for next-token prediction.

### Task:
- Scaffold a function to compute output logits from GRU hidden states.
- Add a docstring explaining its use in sequence prediction.

In [None]:
def gru_output_layer(h_states, W_hy, b_y):
    """
    Compute output logits from GRU hidden states.
    Args:
        h_states (list or np.ndarray): Hidden states (seq_len x hidden_dim)
        W_hy (np.ndarray): Hidden-to-output weights (output_dim x hidden_dim)
        b_y (np.ndarray): Output bias (output_dim,)
    Returns:
        np.ndarray: Output logits (seq_len x output_dim)
    """
    # TODO: Map hidden states to output logits
    pass

## 🧠 Final Summary: GRUs and LLMs

- GRUs are a simpler, efficient alternative to LSTMs for sequence modeling.
- Transformers build on these ideas, using self-attention to model even longer and more flexible dependencies.
- Understanding GRUs gives you a strong foundation for appreciating the design of LLMs and transformers.

In the next notebook, you'll compare RNNs, LSTMs, and GRUs side by side!