<a href="https://colab.research.google.com/github/mrigakshipandey/seminar-mikroelektronik/blob/main/Taks_3_LLaMa_2_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers accelerate huggingface_hub


## Log in to Hugging Face

In [None]:
from huggingface_hub import login

login()  # Paste your Hugging Face token here (with model access)


## Download and Load the Model

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/Llama-2-7b-hf"  # or llama-3 if you have access

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

## Print all layers and their details

| Parameter                              | Description                                                       | Why It Matters                                                           |
| -------------------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------------------ |
| `hidden_size`                          | Dimensionality of the embeddings and hidden layers (e.g. 4096)    | Controls how much information the model can store per token              |
| `intermediate_size`                    | Size of the feedforward layer inside each transformer block       | Affects the expressiveness of the MLP sub-layer                          |
| `num_hidden_layers`                    | Total number of Transformer blocks (e.g. 32 for LLaMA-2 7B)       | Deeper models usually have better generalization                         |
| `num_attention_heads`                  | Number of attention heads per layer                               | Allows model to attend to different subspaces of input                   |
| `num_key_value_heads`                  | Number of *shared* KV heads for multi-query attention (MQA)       | Improves inference speed & reduces memory (used in LLaMA 2/3)            |
| `vocab_size`                           | Total number of unique tokens the model understands               | Larger vocab supports more languages, rare words                         |
| `Max Context`              | Maximum context length (number of tokens) the model can attend to | Limits how long the input/output sequences can be                        |
| `Rotary Dim` | Rotary embedding parameter (used for positional encoding)         | Affects how attention handles long-range dependencies                    |
| `DropoutPro.`                  | Dropout probability in the hidden layers                          | Regularization to prevent overfitting (not always used during inference) |




In [None]:
config = model.config

print("Model Architecture Summary")
print("-" * 40)
print(f"Model Type        : {config.model_type}")
print(f"Hidden Size       : {config.hidden_size}")
print(f"Intermediate Size : {config.intermediate_size}")
print(f"Num of Layers     : {config.num_hidden_layers}")
print(f"Attention Heads   : {config.num_attention_heads}")
print(f"KV Heads          : {getattr(config, 'num_key_value_heads', 'N/A')}")
print(f"Vocab Size        : {config.vocab_size}")
print(f"Max Context (Seq) : {getattr(config, 'max_position_embeddings', 'N/A')}")
print(f"Rotary Dim        : {getattr(config, 'rope_theta', 'N/A')}")
print(f"Dropout Prob.     : {config.hidden_dropout_prob if hasattr(config, 'hidden_dropout_prob') else 'N/A'}")


## The General Architecture of LLaMa
![llama](https://miro.medium.com/v2/resize:fit:373/1*CQs4ceLpN8tIN8QyezL2Ag.png)





In [None]:
#for name, module in model.named_modules():
    #print(name)


## Input Layer
Let us assume an Input tensor of dimensions (1, 5).

Where 1 is the Batch size and 5 is the number of token in the Input Sequence

## Embedding Layer
Each token in the Input Sequence is a part of a pre defined vocabulary.

We use the position of the token in the vocabulary to perform a lookup operation in the embedding matrix.

The embedding matrix: shape `(vocab_size, hidden_size)`.
- In our case it's (32000, 4096)


In [None]:
embed_layer = model.model.embed_tokens
print("Embedding matrix shape:", embed_layer.weight.shape)
print("Vocab size:", embed_layer.weight.shape[0])
print("Hidden size:", embed_layer.weight.shape[1])

Embedding can be understood to carry the meaning of the token.

Given our Input dimensaion (1, 5). After Embedding we'll get a tensor with dimensions (1, 5, 4096).

---
Additional Notes
- In the Original Transformer Architecture, this was followed up by an **Absolute positional Encoding**.

- In the LLaMa Model, we use **Rotary Positional Encoding** later with the Query and Key Matrix.

### Normalization
Layer Norm uses Recentering and Rescaling, on the contrary RMS norm uses the hypothesis that Rescaling and not Recentering is the major contributing factor for Normalization.

- RMS Norm is popular because it requies fewer computations.

After Normalizing with the RMS, the output is scaled with a learned vecctor.

In [None]:
rms = model.model.layers[0].input_layernorm  # Pick any layer, e.g., layer 0
print("Shape:", rms.weight.shape)

### Linear projections to Q, K, V
Multiply the input tensor with $W^Q$, $W^K$ and $W^V$ learned matrices.

These Matrices have the dimension (4096, 4096).

After Linear Projection the resulting $Q$, $K$ and $V$ matrices have the same dimensions as the Embedded Input tensor.

In [None]:
block = model.model.layers[0]  # Pick any layer, e.g., layer 0

q_proj = block.self_attn.q_proj.weight
k_proj = block.self_attn.k_proj.weight
v_proj = block.self_attn.v_proj.weight
o_proj = block.self_attn.o_proj.weight

print("Query Projection (Q)   :", q_proj.shape)
print("Key Projection (K)     :", k_proj.shape)
print("Value Projection (V)   :", v_proj.shape)
print("Output Projection (O)  :", o_proj.shape)

## Multi Head Attention (MHA)

  ![G-MQA](https://www.mdpi.com/pharmaceuticals/pharmaceuticals-17-01300/article_deploy/html/images/pharmaceuticals-17-01300-g0A3.png)


The 7B version uses MHA, while some higher parameter versions of the model use GQA.


In [None]:
config = model.config

print(f"Attention Heads   : {config.num_attention_heads}")
print(f"KV Heads          : {getattr(config, 'num_key_value_heads', 'N/A')}")

The Query Tensor is split equally for the Attention heads in the dimension of the encoding.

$Q$ dimension = (1, 5, 4096)

Split between 32 Attention heads, each head gets an input tensor of dimension (1, 5, 128)

$K$  and $V$ are also spilt in a similar way.

We apply self attention at each head and finally concatinate the output.

![](https://miro.medium.com/v2/resize:fit:893/1*BKsxsnDbIM7eb_dAtws3Yg.png)

## RoPE (Rotary Positional Encoding)

  RoPE is applied at $Q$ and $K$ matrices

  ![RoPE](https://miro.medium.com/v2/resize:fit:1400/1*jkoR140vi3LncTrVmvdsDQ.png)

The cos and sin tensors can be precomputed once and reused.
- This Rotates the tensor to update the tokens with positional information

In [None]:
# Tokenize input text
text = "Hello LLM User"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
input_ids = inputs["input_ids"]

# Get rotary embedding module and generate sin, cos
seq_len = input_ids.shape[1]
rotary_emb = model.model.rotary_emb
position_ids = torch.arange(seq_len, dtype=torch.long, device=input_ids.device).unsqueeze(0)
cos, sin = rotary_emb(x=input_ids,position_ids=position_ids)

print("cos shape:", cos.shape)
print("sin shape:", sin.shape)



## K-V Cache
In Transformer-based autoregressive models (like GPT, LLaMA, etc.), during inference, the model generates one token at a time, step by step, from left to right.

At each step:

- The model takes all previously generated tokens as input to predict the next token.

- Computing attention from scratch over all previous tokens at every step would be very expensive. K-V cache holds the value for previous input token. The output is used as the Query in the next iteration.



## Feed Forward + SwiGLU
Until now we have used Self-Attention to enrich a tokens meaning through the words surrounding it and it's position in the sequence.

Through a Feed Forward network we can apply a learned transformation at each token independentaly.

We can think that Self-attention provids context, while FF provides common knowledge.

In [None]:
layer = model.model.layers[0] # Pick a layer (e.g., layer 0)

# Access the MLP (Feed-Forward + SwiGLU)
mlp = layer.mlp

# Check the components
print("Gate projection:", mlp.gate_proj)
print("Up projection:", mlp.up_proj)
print("Down projection:", mlp.down_proj)

The steps involved are:
1. Linear Projections: Gate Projection and Up Projection (Moving the token vector to a higher embedding space; 11008 vs 4096)
2. Followed by SwiGLU activation (Selectively pass through useful information and suppress irrelevant parts)
3. At last a down projection to return to the original embedding space

- This is performed 32 Times in a row

##Final Linear Layer
We perform a final Linear Projection. From the output embedding we project the scores for the likelyhood of the succeeding token. These are called logits. To these logits we apply a softmax to convert them into probabilities.

In [None]:
print("LM head weight shape:", model.lm_head.weight.shape)