<a href="https://colab.research.google.com/github/mrigakshipandey/seminar-mikroelektronik/blob/main/Task_4_LLaMa_2_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers accelerate huggingface_hub

## Log in to Hugging Face

In [None]:
from huggingface_hub import login

login()  # Paste your Hugging Face token here (with model access)


## Model

In [None]:
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

## Model Architecture Summary


| Parameter                              | Description                                                       | Why It Matters                                                           |
| -------------------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------------------ |
| `hidden_size`                          | Dimensionality of the embeddings and hidden layers (e.g. 4096)    | Controls how much information the model can store per token              |
|
| `num_hidden_layers`                    | Total number of Transformer blocks (e.g. 32 for LLaMA-2 7B)       | Deeper models usually have better generalization                         |
| `num_attention_heads`                  | Number of attention heads per layer                               | Allows model to attend to different subspaces of input                   |
| `num_key_value_heads`                  | Number of *shared* KV heads for multi-query attention (MQA)       | Improves inference speed & reduces memory (used in LLaMA 2/3)            |
| `vocab_size`                           | Total number of unique tokens the model understands               | Larger vocab supports more languages, rare words                         |
| `Max Context`              | Maximum context length (number of tokens) the model can attend to | Limits how long the input/output sequences can be                        |
| `Rotary Dim` | Rotary embedding parameter (used for positional encoding)         | Affects how attention handles long-range dependencies                    |
| `DropoutPro.`                  | Dropout probability in the hidden layers                          | Regularization to prevent overfitting (not always used during inference) |




In [None]:
config = model.config

print("Model Architecture Summary")
print("-" * 40)
print(f"Model Type        : {config.model_type}")
print(f"Hidden Size       : {config.hidden_size}")
print(f"Num of Layers     : {config.num_hidden_layers}")
print(f"Attention Heads   : {config.num_attention_heads}")
print(f"KV Heads          : {getattr(config, 'num_key_value_heads', 'N/A')}")
print(f"Vocab Size        : {config.vocab_size}")
print(f"Max Context (Seq) : {getattr(config, 'max_position_embeddings', 'N/A')}")
print(f"Rotary Dim        : {getattr(config, 'rope_theta', 'N/A')}")
print(f"Dropout Prob.     : {config.hidden_dropout_prob if hasattr(config, 'hidden_dropout_prob') else 'N/A'}")


## The General Architecture of LLaMa





![llama2](https://miro.medium.com/1*CQs4ceLpN8tIN8QyezL2Ag.png)

According to the [source code](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py), the model performs the following operations:

- LlamaForCausalLM
  - **Embedding**
  - LlamaDecoderLayer * num_hidden_layers
    - **LlamaRMSNorm** (input_layernorm)
    - **LlamaAttention**
      - Reshape for multi-head attention
      - **Linear projections for Q, K, V**
      - **RoPE**
      - Update KV Cache
      - **Attention**
        - Compute Score for every query-key pair
        - scaling
        - attention_mask
        - Softmax
        - dropout
        - multiply with value
        - Reshape
      - Linar Projection for O
    - residual add
    - LlamaRMSNorm (post_attention_layernorm)
    - LlamaMLP
    - residual add
  - LlamaRMSNorm
  - Linear

## Input

In [None]:
from transformers import AutoTokenizer

# Use the same repo as your model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Tokenize input text
text = "Hello LLM User"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
input_ids = inputs["input_ids"]

print("Token IDs:", input_ids)
print("Decoded back:", tokenizer.decode(input_ids[0]))

Input tensor of dimensions (1, 5).

Where 1 is the Batch size and 5 is the number of token in the Input Sequence

## Embedding Layer
Each token in the Input Sequence is a part of a pre defined vocabulary. (This model has a vocab_size of 32000.)

We use the position of the token in the vocabulary to perform a lookup operation in the embedding matrix.

In [None]:
embed_layer = model.model.embed_tokens
print("Embedding matrix shape:", embed_layer.weight.shape)
print("- Vocab size:", embed_layer.weight.shape[0])
print("- Hidden size:", embed_layer.weight.shape[1])
print("Note: Dtype:", embed_layer.weight.dtype)

![image.png](https://www.3blue1brown.com/content/lessons/2024/gpt/token.png)

---
---
### Embedding: Description
Embedding can be understood to carry the meaning of the token.

Given our Input dimensaion (1, 5).

And the embedding matrix shape (32000, 4096)

After Embedding we'll get a tensor with dimensions (1, 5, 4096).

---
Additional Notes
- In the Original Transformer Architecture, this was followed up by an **Absolute positional Encoding**.

- In the LLaMa Model, we use **Rotary Positional Encoding** later with the Query and Key Matrix.


---
---
### Embedding Matrix: Static Memory Requirements
Number of Parameters
- 32000 * 4096 = 131,072,000

Therefore, the Memory requirements for float16
- 32000 * 4096 * (16/8) = 262144000 (over 262 MB)

In [None]:
def get_embedding_static_memory():

  module = model.model.embed_tokens
  num_params = sum(p.numel() for p in module.parameters())
  param_size = sum(p.numel() * p.element_size() for p in module.parameters())

  print("\nEmbedding parameters:", num_params)
  print("Embedding memory: {:.2f} MB".format(num_params * 2 / (1024 * 1024)))  # fp16

get_embedding_static_memory()

---
---
### Embedding: Operations per Token
It's a lookup operation nearly no FLOPS are performed

---
---
###  Embedding Forward Pass: Accessed Bytes
To fetch the Embedding for 1 batch of seq length
- 1 * seq * 4096 * (16/8)

For the longest allowable sequence length (4096)
- 1 * 4096 * 4096 * (16/8) = 33554432 (over 33 MB)

---
---

## RMS Norm
Without normalisation, the magnitude of activations can explode or vanish as the depth increases. So any normalisation (LayerNorm, RMSNorm, etc.) helps deep transformers like LLaMA learn better and generalize faster.

Layer Norm uses Recentering and Rescaling, on the contrary RMS norm uses the hypothesis that Rescaling and not Recentering is the major contributing factor for Normalization.

- RMS Norm is popular because it requies fewer computations.

Source code:
```
def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)
```

In [None]:
# Example: take the RMSNorm before the first self-attention
rmsnorm = model.model.layers[0].input_layernorm

# 1) The learnable scale parameter (w)
print("Weight shape:", rmsnorm.weight.shape)   # (hidden_size,)

# 2) The epsilon value (fixed, not a Parameter)
print("Epsilon:", rmsnorm.variance_epsilon)

In [None]:
from torch.fx import symbolic_trace
from transformers.models.llama.modeling_llama import LlamaRMSNorm

# Define the module
hidden_dim = 4096
rms = LlamaRMSNorm(hidden_dim)

# Trace the computation graph
rms_traced = symbolic_trace(rms)

# Inspect individual nodes
for node in rms_traced.graph.nodes:
    print(node.op, node.target, node.args)


---
---
### RMS Norm: Operations Breakdown

- The input tensor x
1. Cast input to float32 (even if model runs in fp16)
2. Square every element
3. Compute mean across the hidden dimension
4. Add epsilon
5. Reciprocal square root
6. Multiply with x
7. Retrieve learnable scaling parameter w
8. Get dtype of original input
9. Cast result back to original dtype
10. Multiply elementwise by learnable weight w.
- RMSNorm output

---
---
### RMS Norm: Static Memory Requirements


In [None]:
from transformers.models.llama.modeling_llama import LlamaRMSNorm
def get_rmsnorm_static_memory():
    hidden_dim = 4096
    module = LlamaRMSNorm(hidden_dim)

    num_params = sum(p.numel() for p in module.parameters())
    param_size = sum(p.numel() * p.element_size() for p in module.parameters())

    print("\nRMSNorm parameters:", num_params)
    print("RMSNorm memory: {:.2f} MB".format(num_params * 2 / (1024 * 1024)))  # 2 bytes per fp16 param

get_rmsnorm_static_memory()

---
---
### RMS Norm: Operations
Let:

batch = B, seq = S, hidden = D

N = B·S·D

1. `Cast:` x.to(float32)

    FLOPs: 0

2. `Square:` pow(x32, 2)

    FLOPs: D (muliplications)

3. `Mean:` mean(x32^2, dim=-1, keepdim=True) → shape (B,S,1)

    FLOps: D-1 (additions) + 1 (divison: hardware dependent)
    = D

4. `Add epsilon:` add(eps) on (B,S,1)

    FLOps: 1 (addition)

5. `Reciprocal square root:`  rsqrt on (B,S,1)

    FLOps: 1 (sqrt: hardware dependent) + 1 (reciprocal: hardware dependent)

6. `Multiply with x:` mul(x32, inv_r) (broadcast (B,S,1) → (B,S,D))

    FLOps: D

7. `Retrieve w:` get_attr(weight) (γ, FP32)

    FLOPs: 0

8. `Get dtype of original input`

    FLOPs: 0

9. `Cast to original dtype:` y32.to(x.dtype) (back to input dtype)

    FLOPs: 0

10. `Multiply elementwise:` mul(y, weight) (broadcast γ over batch/seq)

    FLOps: D



---
---
### RMS Norm: Accessed Bytes
Let:

batch = B, seq = S, hidden = D

N = B·S·D

1. `Cast:` x.to(float32)

    Reads: N · s_in

    Writes: N · s32

2. `Square:` pow(x32, 2)

    Reads: N · s32

    Writes: N · s32

3. `Mean:` mean(x32^2, dim=-1, keepdim=True) → shape (B,S,1)

    Reads: N · s32

    Writes: (B·S) · s32

4. `Add epsilon:` add(eps) on (B,S,1)

    Reads: (B·S) · s32

    Writes: (B·S) · s32

5. `Reciprocal square root:`  rsqrt on (B,S,1)

    Reads: (B·S) · s32

    Writes: (B·S) · s32

6. `Multiply with x:` mul(x32, inv_r) (broadcast (B,S,1) → (B,S,D))

    Reads: N · s32 + (B·S) · s32

    Writes: N · s32

7. `Retrieve w:` get_attr(weight) (γ, FP32)

    Reads: D · 4

    Writes: 0

8. `Get dtype of original input`

    Reads: 0

    Writes: 0

9. `Cast to original dtype:` y32.to(x.dtype) (back to input dtype)

    Reads: N · s32

    Writes: N · s_in

10. `Multiply elementwise:` mul(y, weight) (broadcast γ over batch/seq)

    Reads: N · s_in + D · 4

    Writes: N · s_in



## Q K V O Linear Projections
Multiply the input tensor with $W^Q$, $W^K$ and $W^V$ learned matrices.

These Matrices have the dimension (4096, 4096).

After Linear Projection the resulting $Q$, $K$ and $V$ matrices have the same dimensions as the Embedded Input tensor.

Source Code:
```
self.q_proj = nn.Linear(
            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
)
self.k_proj = nn.Linear(
            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
)
self.v_proj = nn.Linear(
            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
)
self.o_proj = nn.Linear(
            config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
)
```

This means we are going to perform a dot product between the hidden state and the matrix weights.

In [None]:
block = model.model.layers[0]  # Pick any layer, e.g., layer 0

q_proj = block.self_attn.q_proj.weight
k_proj = block.self_attn.k_proj.weight
v_proj = block.self_attn.v_proj.weight
o_proj = block.self_attn.o_proj.weight

print("Query Projection (Q)   :", q_proj.shape)
print("Key Projection (K)     :", k_proj.shape)
print("Value Projection (V)   :", v_proj.shape)
print("Output Projection (O)  :", o_proj.shape)

---
---
### Linear Projection: Static Memory Requirements

Weights = 4096 × 4096 = 16,777,216 parameters

Bias (if enabled) = 4096 parameters

Total parameters = ~16.78M

Memory in FP16 (2 bytes per param) = ~32 MB

---
---
###Linear Projection: Operations per Token
Operations per token for one Linear Layer
- For Weights
  - Multiplications: 4096 * 4096
  - Additions: 4096 * (4096 - 1)
- For Bias
  - Additions: 4096

Total no of Operations per Token 2 * 4096 * 4096

= 33,554,432

---
---
###Linear Projection: Accessed Bytes

READ:
1. Input vector x:

  - Shape = (in_features,) = (4096,)

  - Size = 4096 × dtype_size

  - In FP16 → 4096 × 2B = 8 KB

2. Weight matrix W:

  - Shape = (out_features, in_features) = (4096, 4096)

  - Size = 16,777,216 × 2B = 32 MB

3. Bias vector b (if present):

  - Shape = (out_features,) = (4096,)

  - Size = 4096 × 2B = 8 KB

WRITE:

1. Output vector y:

  - Shape = (out_features,) = (4096,)

  - Size = 4096 × 2B = 8 KB

## RoPE (Rotary Positional Encoding)

  RoPE is applied at $Q$ and $K$ matrices

  ![RoPE](https://miro.medium.com/v2/resize:fit:1400/1*jkoR140vi3LncTrVmvdsDQ.png)

The cos and sin tensors are precomputed once and reused.
- This Rotates the tensor to update the tokens with positional information

Source Code:
```
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
```

**Multi Head Attention (MHA)**

  ![G-MQA](https://www.mdpi.com/pharmaceuticals/pharmaceuticals-17-01300/article_deploy/html/images/pharmaceuticals-17-01300-g0A3.png)


The 7B version uses MHA, while some higher parameter versions of the model use GQA.


The Query Tensor is split equally for the Attention heads in the dimension of the encoding.

$Q$ dimension = (1, 5, 4096)

Split between 32 Attention heads, each head gets an input tensor of dimension (1, 5, 128)

$K$  and $V$ are also spilt in a similar way.

We apply self attention at each head and finally concatinate the output.



In [None]:
config = model.config

print(f"Attention Heads   : {config.num_attention_heads}")
print(f"KV Heads          : {getattr(config, 'num_key_value_heads', 'N/A')}")

---
---
###RoPE: Static Memory Requirements


In [None]:
# Tokenize input text
text = "Hello LLM User"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
input_ids = inputs["input_ids"]

# Get rotary embedding module and generate sin, cos
seq_len = input_ids.shape[1]
rotary_emb = model.model.rotary_emb
position_ids = torch.arange(seq_len, dtype=torch.long, device=input_ids.device).unsqueeze(0)
cos, sin = rotary_emb(x=input_ids,position_ids=position_ids)

print("cos shape:", cos.shape)
print("sin shape:", sin.shape)



Typically the implementation stores two buffers:

cos — shape (max_seq_len, rotary_dim)

sin — shape (max_seq_len, rotary_dim)

These are stored once and broadcast at runtime.

Some implementations keep a shaped/broadcasted view like (1, seq_len, 1, head_dim)like the example above, but it would simply be a subset of the full vector.

Total elements = 2 * max_seq_len * rotary_dim

Bytes = 2 * max_seq_len * rotary_dim * dtype_size

= 2 * 4096 * 128 * 2 = 2MB

---
---
###RoPE: Operations
According to the source code we will pwe form 3 Ops per element


---
---
###RoPE: Accessed Bytes
Per-token per-head:

Read Q: 128 * 2 B

Read K: 128 * 2 B

Read: cos/sin per position = 2 * 128 * 2
but shared across heads;
per-head share ≈ (2 * 128 * 2) / 32 B

Write rotated Q,K: 2 * 128 * 2 B

## Attention
![](https://miro.medium.com/v2/resize:fit:893/1*BKsxsnDbIM7eb_dAtws3Yg.png)

Source Code:
```
    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
    if attention_mask is not None:
        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
        attn_weights = attn_weights + causal_mask

    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
    attn_output = torch.matmul(attn_weights, value_states)
    attn_output = attn_output.transpose(1, 2).contiguous()
```

## Feed Forward + SwiGLU
Until now we have used Self-Attention to enrich a tokens meaning through the words surrounding it and it's position in the sequence.

Through a Feed Forward network we can apply a learned transformation at each token independentaly.

We can think that Self-attention provids context, while FF provides common knowledge.

In [None]:
layer = model.model.layers[0] # Pick a layer (e.g., layer 0)

# Access the MLP (Feed-Forward + SwiGLU)
mlp = layer.mlp

# Check the components
print("Gate projection:", mlp.gate_proj)
print("Up projection:", mlp.up_proj)
print("Down projection:", mlp.down_proj)

The steps involved are:
1. Linear Projections: Gate Projection and Up Projection (Moving the token vector to a higher embedding space; 11008 vs 4096)
2. Followed by SwiGLU activation (Selectively pass through useful information and suppress irrelevant parts)
3. At last a down projection to return to the original embedding space