# Transformer Architecture Questions

Based on the content from the notebook, here are answers to the questions about transformer architecture:

## 1. Why do we implement PositionalEncoding in part 1?

Positional Encoding is implemented because transformers don't have any inherent understanding of sequence order since they process tokens in parallel rather than sequentially. As explained in the notebook:

> We have Positional Encoding, which we require for the Transformer model to understand and relate the relative position of input and output tokens or embeddings.

The notebook implements positional encoding using sine and cosine functions of different frequencies:



In [None]:
PE(pos,2i) = sin(pos/10000^(2i/d_model))
PE(pos,2i+1) = cos(pos/10000^(2i/d_model))



This allows the model to inject information about token positions without losing the benefits of parallelization.

## 2. How do we feed Query, Key, and Value to the multi-head attention blocks?

In the notebook's `MultiHeadAttention` class implementation, Query, Key, and Value are processed as follows:

1. Input tensors are linearly projected to create query, key, and value representations:
   ```python
   query = self.split_heads(self.query_linear(query), batch_size)
   key = self.split_heads(self.key_linear(key), batch_size)
   value = self.split_heads(self.value_linear(value), batch_size) 
   ```

2. These projections are split across multiple attention heads using the `split_heads` method

3. Attention scores are computed between query and key:
   ```python
   attention_weights = self.compute_attention(query, key, mask)
   ```

4. The attention weights are applied to the values:
   ```python
   output = torch.matmul(attention_weights, value)
   ```

5. Results from all heads are concatenated and projected to the output dimension

## 3. What is the purpose of the mask (self_attention_mask) defined in Part 9?

The `self_attention_mask` in Part 9 is a causal mask that creates a triangular attention pattern. Its purpose is to prevent the decoder from "cheating" by looking at future tokens when making predictions. The notebook explains:

> The triangular mask defined here is the causal mask that prohibits the decoder from observing the "future" or cheating. For the first element in the sequence, the decoder can only observe the first element; for the second, the second and the first; and for the nth element, the decoder can only observe elements (tokens) up to the nth element.

This mask is implemented as:


In [None]:
self_attention_mask = (1 - torch.triu(torch.ones(1, sequence_length, sequence_length), diagonal=1)).bool()



## 4. Why do we use split heads in the attention mechanism?

Split heads are used in the attention mechanism to allow the model to focus on different representational subspaces simultaneously. The notebook implements this in the `MultiHeadAttention` class:



In [None]:
def split_heads(self, x, batch_size):
    # Split the sequence embeddings in x across the attention heads
    x = x.view(batch_size, -1, self.num_heads, self.head_dim)
    return x.permute(0, 2, 1, 3).contiguous().view(batch_size * self.num_heads, -1, self.head_dim)



The multi-head approach allows different heads to attend to different parts of the input, creating a more powerful attention mechanism. As described in the notebook:

> Multi-head attention calculated as:
> MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
> where head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)

## 5. What is the difference between self-attention and cross-attention?

Based on the notebook:

- **Self-attention**: Involves computing attention where query, key, and value all come from the same sequence. Used in both encoders and decoders to allow tokens to attend to other tokens in the same sequence.

- **Cross-attention**: Used in the decoder layers of encoder-decoder transformers to attend to encoder outputs. The notebook states:
  > The second attention mechanism in the Decoder architecture is Encoder-Decoder attention, or cross-attention layer. The keys and values come from the output of the Encoder stack while queries come from the first self-attention layer of the Decoder stack. With this cross-attention, decoder can attend over all positions in the input.

## 6. Where exactly is the cross-attention mask applied in the vanilla transformer architecture?

In the vanilla transformer architecture, the cross-attention mask is applied in the decoder layers, specifically in the cross-attention mechanism. From the notebook's `DecoderLayer` implementation:



In [None]:
def forward(self, x, causal_mask, encoder_output, cross_mask):
    # Multi-head self-attention
    self_attn_output = self.self_attn(x, x, x, causal_mask)
    x = self.norm1(x + self.dropout(self_attn_output))
    # Cross-attention - note the cross_mask being applied here
    cross_attn_output = self.cross_attn(x, encoder_output, encoder_output, cross_mask)
    x = self.norm2(x + self.dropout(cross_attn_output))
    ff_output = self.feed_forward(x)
    x = self.norm3(x + self.dropout(ff_output))
    return x



The cross-attention mask is passed to the cross-attention mechanism to indicate which encoder positions each decoder position should be allowed to attend to.

## 7. Which of these (Q, K, and V) is supplied from the Encoder to the cross-attention in the Encoder-Decoder transformer? And which from the decoder's attention?

In the Encoder-Decoder transformer's cross-attention mechanism:

- **From the Encoder**: Key (K) and Value (V) come from the encoder's output
- **From the Decoder**: Query (Q) comes from the decoder's self-attention layer output

This is implemented in the `DecoderLayer` class:


In [None]:
cross_attn_output = self.cross_attn(x, encoder_output, encoder_output, cross_mask)



Where `x` (from the decoder) provides the queries, and `encoder_output` provides both keys and values.

## 8. Why are we shifting outputs to the right (in the vanilla Transformer architecture)?

The outputs are shifted to the right in the vanilla Transformer architecture to create the target sequence for autoregressive training. In the notebook, this is demonstrated in Part 12:



In [None]:
target_sequence = input_sequence.roll(1)



This shift creates a training scenario where the model learns to predict the next token in a sequence given the previous tokens, enabling autoregressive generation during inference.

## 9. What can we do to improve the training process in Parts 6, 9, and 12?

Based on the notebook, several improvements could be made to the training processes:

- **Use real data**: Replace random sequences with actual task-specific data
- **Hyperparameter tuning**: Optimize model dimensions, learning rates, and dropout rates
- **Implement learning rate scheduling**: Utilize warmup and decay strategies
- **Add regularization techniques**: Layer normalization and weight tying
- **Use larger models**: Increase depth and width for more complex tasks
- **Implement more sophisticated optimization**: Use specialized optimizers beyond Adam
- **Employ better initialization strategies**: For faster convergence
- **Add gradient clipping**: To handle exploding gradients
- **Implement early stopping**: To prevent overfitting

## 10. What are some possible use-case scenarios for the different transformer types?

The notebook outlines specific use cases for each transformer variant:

**Encoder-only Transformer**:
- Text classification tasks
- Named entity recognition
- Document classification
- Sentiment analysis
- BERT is mentioned as a prominent example

**Decoder-only Transformer**:
- Text generation
- Story writing
- Code completion
- Chat completion
- GPT-3 is mentioned as a notable example

**Encoder-Decoder Transformer**:
- Machine translation
- Text summarization
- Question answering
- Data-to-text generation
- T5 (Text-to-Text Transfer Transformer) is mentioned as an example

Similar code found with 2 license types