# 16 - Adding Attention Mechanism to Encoder-Decoder

Attention mechanisms allow models to focus on different parts of the input sequence when generating each output. This solves the bottleneck of fixed-size context vectors and is the key innovation that led to transformers and modern LLMs.

In this notebook, you'll scaffold the steps to add attention to an encoder-decoder model, building up to the core of the transformer architecture.

## 🔎 What is Attention?

Attention computes a weighted sum of encoder hidden states, where the weights are determined by the similarity between the current decoder state and each encoder state.

**LLM/Transformer Context:**
- Attention allows the model to access all input positions at every decoding step, enabling long-range dependencies and flexible context.

### Task:
- Scaffold a function to compute attention weights given a decoder hidden state and all encoder hidden states.
- Add a docstring explaining its role.

In [None]:
def compute_attention_weights(decoder_hidden, encoder_hiddens):
    """
    Compute attention weights for the decoder hidden state over all encoder hidden states.
    Args:
        decoder_hidden (np.ndarray): Current decoder hidden state (hidden_dim,)
        encoder_hiddens (np.ndarray): All encoder hidden states (seq_len x hidden_dim)
    Returns:
        np.ndarray: Attention weights (seq_len,)
    """
    # TODO: Compute similarity scores and normalize (e.g., softmax)
    pass

## 🔗 Context Vector with Attention

The context vector for the decoder at each step is a weighted sum of encoder hidden states, using the attention weights.

**LLM/Transformer Context:**
- This is the core of the attention mechanism in transformers, allowing dynamic context for each output position.

### Task:
- Scaffold a function to compute the context vector using attention weights and encoder hidden states.
- Add a docstring explaining its role.

In [None]:
def compute_context_vector(attn_weights, encoder_hiddens):
    """
    Compute the context vector as a weighted sum of encoder hidden states.
    Args:
        attn_weights (np.ndarray): Attention weights (seq_len,)
        encoder_hiddens (np.ndarray): Encoder hidden states (seq_len x hidden_dim)
    Returns:
        np.ndarray: Context vector (hidden_dim,)
    """
    # TODO: Compute weighted sum
    pass

## 🧮 Decoder with Attention

At each decoding step, the decoder uses both its previous hidden state and the context vector from attention to generate the next output.

**LLM/Transformer Context:**
- This is the precursor to the multi-head self-attention in transformers.

### Task:
- Scaffold a function for a decoder step that incorporates the context vector from attention.
- Add a docstring explaining its role.

In [None]:
def decoder_step_with_attention(y_prev, h_prev, context_vector, decoder_params):
    """
    Perform one decoder step using the previous output, previous hidden state, and context vector.
    Args:
        y_prev (np.ndarray): Previous output/input to decoder (input_dim,)
        h_prev (np.ndarray): Previous decoder hidden state (hidden_dim,)
        context_vector (np.ndarray): Context vector from attention (hidden_dim,)
        decoder_params (dict): Decoder parameters (weights, biases, etc.)
    Returns:
        tuple: (new_hidden, output_logits)
    """
    # TODO: Implement decoder step with context vector
    pass

## 🔁 Full Encoder-Decoder with Attention

Combine the encoder, attention mechanism, and decoder to process an input sequence and generate an output sequence.

**LLM/Transformer Context:**
- This structure is the direct precursor to the transformer encoder-decoder architecture used in LLMs like T5 and BART.

### Task:
- Scaffold a function to run the full encoder-decoder with attention for a sequence-to-sequence task.
- Add a docstring explaining the workflow.

In [None]:
def encoder_decoder_with_attention(X_seq, Y_seq, encoder_params, decoder_params):
    """
    Run the full encoder-decoder model with attention for a sequence-to-sequence task.
    Args:
        X_seq (np.ndarray): Input sequence (seq_len x input_dim)
        Y_seq (np.ndarray): Output sequence input (seq_len x input_dim)
        encoder_params (dict): Encoder parameters
        decoder_params (dict): Decoder parameters
    Returns:
        np.ndarray: Output logits for each output position (seq_len x output_dim)
    """
    # TODO: Implement the full encoder-decoder with attention
    pass

## 🧠 Final Summary: Attention and LLMs

- Attention mechanisms allow models to dynamically focus on relevant parts of the input, solving the bottleneck of fixed-size context vectors.
- This innovation led directly to the transformer architecture, which is the foundation of all modern LLMs.
- Mastering attention is key to understanding how LLMs process and generate language.

In the next notebook, you'll explore positional encoding, which allows transformers to model order in sequences!