# 12 - LSTM from Scratch

Long Short-Term Memory (LSTM) networks are a type of RNN designed to capture long-range dependencies in sequences. LSTMs were a major advance in sequence modeling before transformers, and understanding them helps you appreciate the challenges transformers solve in LLMs.

In this notebook, you'll scaffold the core logic of an LSTM cell and see how it improves on basic RNNs.

## 🔢 LSTM Cell: Gates and State

An LSTM cell uses gates to control the flow of information and maintain a cell state across time steps:
- Forget gate
- Input gate
- Output gate
- Cell state update

**LLM/Transformer Context:**
- LSTMs were the state-of-the-art for sequence modeling before transformers. They address the vanishing gradient problem and can model longer dependencies than vanilla RNNs.

### Task:
- Scaffold a function for a single LSTM cell step (forward pass).
- Add a docstring explaining the role of each gate and state.

In [None]:
def lstm_cell_step(x_t, h_prev, c_prev, params):
    """
    Compute one step of an LSTM cell.
    Args:
        x_t (np.ndarray): Input at time t (input_dim,)
        h_prev (np.ndarray): Previous hidden state (hidden_dim,)
        c_prev (np.ndarray): Previous cell state (hidden_dim,)
        params (dict): LSTM parameters (weights and biases for all gates)
    Returns:
        tuple: (h_t, c_t) - current hidden and cell state
    """
    # TODO: Implement the LSTM cell computation (gates, cell state, hidden state)
    pass

## 🔗 Unrolling the LSTM Over a Sequence

To process a sequence, the LSTM cell is applied at each time step, passing both the hidden and cell state forward.

**LLM/Transformer Context:**
- LSTMs can model longer dependencies than RNNs, but transformers go even further with self-attention.

### Task:
- Scaffold a function to run an LSTM over an entire input sequence.
- Add a docstring explaining the sequence processing.

In [None]:
def lstm_forward(X_seq, h0, c0, params):
    """
    Run an LSTM over an input sequence.
    Args:
        X_seq (np.ndarray): Input sequence (seq_len x input_dim)
        h0 (np.ndarray): Initial hidden state (hidden_dim,)
        c0 (np.ndarray): Initial cell state (hidden_dim,)
        params (dict): LSTM parameters
    Returns:
        list: List of (h_t, c_t) tuples for each time step
    """
    # TODO: Implement LSTM unrolling over the sequence
    pass

## 🧮 Output Layer: From Hidden State to Prediction

After processing the sequence, the LSTM's hidden state(s) are mapped to output predictions (e.g., next token probabilities).

**LLM/Transformer Context:**
- This is analogous to the output projection in transformers for next-token prediction.

### Task:
- Scaffold a function to compute output logits from LSTM hidden states.
- Add a docstring explaining its use in sequence prediction.

In [None]:
def lstm_output_layer(h_states, W_hy, b_y):
    """
    Compute output logits from LSTM hidden states.
    Args:
        h_states (list or np.ndarray): Hidden states (seq_len x hidden_dim)
        W_hy (np.ndarray): Hidden-to-output weights (output_dim x hidden_dim)
        b_y (np.ndarray): Output bias (output_dim,)
    Returns:
        np.ndarray: Output logits (seq_len x output_dim)
    """
    # TODO: Map hidden states to output logits
    pass

## 🧠 Final Summary: LSTMs and LLMs

- LSTMs were a major advance in sequence modeling, enabling learning of long-range dependencies.
- Transformers build on these ideas, using self-attention to model even longer and more flexible dependencies.
- Understanding LSTMs gives you a strong foundation for appreciating the design of LLMs and transformers.

In the next notebook, you'll explore GRUs, another important RNN variant!