**Q.4** What Does Depth Buy Us?

You will implement a mini Transformer encoder with a variable number of self-attention layers (e.g., 1, 2, or 4). Each layer consists of
1. A single-head self attention mechanism
2. A simple feedforward layer
3. Residual Connections (optional but encouraged)

**Task**:
- Implement the encoder from scratch in PyTorch.
- Create a toy input sequence of token embeddings (e.g., a fixed random tensor of shape [seq_len, d_model]).
- Pass the same input sequence through 1-layer, 2-layer, and 4-layer self-attention stacks.
- Visualize the attention weights for one token (e.g., the middle token) across all layers.
- For one token (e.g. the middle token), plot how its attention distribution evolves with depth.

**In your own words, explain**:
- Does deeper attention "sharpen", "blur", or "redirect" focus?
- How is meaning transformed across layers?

> You are free to use a fixed positional encoding (e.g., sinusoidal) or none at all. The goal is to study qualitative changes in attention with increasing depth.

```python
# Suggested structure
class SelfAttentionLayer(nn.Module):
 ...

class TransformerEncoder(nn.Module):
 def __init__(self, num_layers):
 ...

# Use random or fixed embeddings for a toy example input sequence

In [1]:
import torch
import numpy as np
import torch.nn as nn
import math
import matplotlib.pyplot as plt

In [2]:
seq_length= 10
d_model= 64

toy_input_seq= torch.randn(seq_length, d_model)
print(toy_input_seq.shape)
print(toy_input_seq)

torch.Size([10, 64])
tensor([[-1.6114e-01, -1.3445e+00, -1.4474e+00,  2.0271e+00,  6.6079e-01,
          5.4417e-01, -1.6611e-01,  9.2867e-02, -1.4756e+00,  7.7923e-01,
          9.2904e-01, -1.2588e-01,  1.5676e+00,  9.5272e-01,  5.4230e-01,
         -2.4356e-01, -1.2689e+00,  1.7080e-01,  1.3478e+00,  6.2493e-01,
         -1.1023e+00,  1.6912e-01, -7.8394e-01,  2.8224e+00,  2.2944e+00,
          3.9569e-01,  2.3372e+00, -7.2517e-02,  5.3807e-01, -2.7931e-01,
          1.2905e-01,  9.1313e-01, -5.1254e-01,  6.0260e-01, -1.1700e-01,
         -2.8920e-03,  1.0927e+00, -4.7146e-01,  2.9892e+00,  1.1947e-01,
          5.4790e-01, -1.8880e+00, -1.5808e+00,  1.7201e+00, -8.7866e-01,
         -1.1480e+00,  9.2404e-02, -1.4438e-01,  5.5407e-02,  6.5202e-02,
          1.9071e+00, -4.5858e-01,  8.3533e-01,  2.7186e-01,  1.2193e+00,
          2.2775e-01,  1.7512e-01, -1.3678e+00, -6.9906e-01,  3.8462e-03,
         -6.2365e-02, -1.4455e-03,  1.7665e+00,  8.4656e-01],
        [ 5.4657e-01, -2.0432

In [3]:
class SelfAttention_Feedforward_normalisation(nn.Module):
  def __init__(self, d_model ):
    super(SelfAttention_Feedforward_normalisation, self).__init__ ()

    self.d_model=d_model
    self.w_query=nn.Linear(d_model, d_model)
    self.w_key=nn.Linear(d_model, d_model)
    self.w_value=nn.Linear(d_model,d_model)

    self.l1=nn.Linear(d_model, d_model)
    self.relu=nn.ReLU()
    self.l2=nn.Linear(d_model,d_model)

    self.layer_norm=nn.LayerNorm(d_model)

  def forward(self,x):

    query_vector =self.w_query(x)  #XW_q
    key_vector= self.w_key(x) #XW_k
    value_vector= self.w_value(x) #XW_v

    E= torch.matmul(query_vector, key_vector.transpose(-2,-1))/math.sqrt(self.d_model)
    attention_weight= torch.softmax(E, dim=-1)
    output_vec=torch.matmul(attention_weight, value_vector)

    norm1=self.layer_norm(x+output_vec)

    out=self.l1(norm1)
    out=self.relu(out)
    out=self.l2(out)

    norm2=self.layer_norm(out+norm1)

    return norm2,attention_weight


In [4]:
class TransformerEncoder(nn.Module):
  def __init__(self, d_model, num_layers):
    super(TransformerEncoder, self).__init__()
    self.layer= nn.ModuleList([
            SelfAttention_Feedforward_normalisation(d_model)
            for i in range(num_layers)
        ])

  def forward(self, x):
    attn_weights_all = []
    for layer in self.layer:
        x, attention_weight = layer(x)
        attn_weights_all.append(attention_weight)
        return x, attention_weight


In [5]:
num_layer=[1,2,4]
attentions={}
results={}

In [6]:
for i in num_layer:
  model=TransformerEncoder(d_model=d_model, num_layers=i)
  out, attn = model(toy_input_seq)
  results[f"{i} layers"] = out
  attentions[f"{i} layer"] = attn[-1]
  print(attentions)

{'1 layer': tensor([0.1431, 0.1340, 0.0672, 0.0862, 0.1688, 0.0989, 0.0730, 0.0633, 0.0641,
        0.1012], grad_fn=<SelectBackward0>)}
{'1 layer': tensor([0.1431, 0.1340, 0.0672, 0.0862, 0.1688, 0.0989, 0.0730, 0.0633, 0.0641,
        0.1012], grad_fn=<SelectBackward0>), '2 layer': tensor([0.0650, 0.0908, 0.0358, 0.1028, 0.0945, 0.1209, 0.2061, 0.0972, 0.1457,
        0.0411], grad_fn=<SelectBackward0>)}
{'1 layer': tensor([0.1431, 0.1340, 0.0672, 0.0862, 0.1688, 0.0989, 0.0730, 0.0633, 0.0641,
        0.1012], grad_fn=<SelectBackward0>), '2 layer': tensor([0.0650, 0.0908, 0.0358, 0.1028, 0.0945, 0.1209, 0.2061, 0.0972, 0.1457,
        0.0411], grad_fn=<SelectBackward0>), '4 layer': tensor([0.0717, 0.1297, 0.0829, 0.0996, 0.0880, 0.0686, 0.0978, 0.0716, 0.1494,
        0.1406], grad_fn=<SelectBackward0>)}
