<a href="https://colab.research.google.com/github/peremartra/Rearchitecting-LLMs/blob/main/CH03/CH03_NB01_Model_structures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Rearchitecting LLMs
## Surgical Optimization for Hyper-Efficient Models


### Chapter 3: The Internal Structure of Modern Transformers
by [Pere Martra](https://github.com/peremartra)
_____
Colab Environment: GPU T4

Models:
* distilgpt2
* meta-llama/Llama-3.2-1B
* google/gemma-3-270m
* microsoft/Phi-4-mini-instruct
* Qwen/Qwen3-0.6B"

_____

In this notebook you'll find the structure of different transformer models.Retry

In [1]:
!pip install -q \
    "transformers==4.51.3" \
    "accelerate==1.3.0" \
    "bitsandbytes"

import torch
import gc
#import psutil
#import os
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer, AutoConfig
#from transformers.models.auto.modeling_auto import MODEL_MAPPING
import warnings
warnings.filterwarnings("ignore")

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m66.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m336.6/336.6 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
def clean_memory(model):
    """Limpia memoria"""
    del model
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

In [3]:
def load_model(model_name):
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype="auto", #torch.float16,
        device_map="cpu",
        trust_remote_code=True
    )

    return model

# Chapter architectures

##DistilGPT2

In [4]:
model = load_model("DistilGPT2")

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [5]:
print(model)
clean_memory(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


**DistilGPT2 - Key Insights Mapped to Model Structure:**

• **Classical Stacked Architecture**: 6 identical GPT2Block layers
  → **Line**: `(0-5): 6 x GPT2Block(`

• **Traditional MLP Design**: Simple linear transformation (768→3072→768) with 4x expansion ratio
  → **Lines**:
  ```
  (c_fc): Conv1D(nf=3072, nx=768)
  (c_proj): Conv1D(nf=768, nx=3072)
  ```

• **Multi-Head Attention**: Classic attention implementation using Conv1D layers (c_attn projects to 2304 dimensions for Q, K, V)
  → **Line**: `(c_attn): Conv1D(nf=2304, nx=768)`

• **Pre-Norm Architecture**: LayerNorm applied before attention (ln_1) and MLP (ln_2), not after
  → **Lines**:
  ```
  (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  ```

• **Compact Embedding Space**: 768-dimensional hidden states with 50,257 vocabulary tokens
  → **Line**: `(wte): Embedding(50257, 768)`

• **Moderate Parameter Count**: ~82M parameters
  → **Pattern**: Calculated from embedding dimensions (50257×768) + layer parameters (visible through 768 dimensions throughout)

• **Classical Width Pruning Target**: The MLP's c_fc layer (768→3072) represents the traditional approach to width pruning
  → **Line**: `(c_fc): Conv1D(nf=3072, nx=768)`

## meta-llama/Llama-3.2-1B

In [6]:
model = load_model("meta-llama/Llama-3.2-1B")

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

In [7]:
print(model)
clean_memory(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb):

**Llama-3.2-1B - Key Insights Mapped to Model Structure:**

• **Modern GLU Architecture**: MLP uses **gating mechanism** (gate_proj + up_proj → down_proj)
  → **Lines**:
  ```
  (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
  (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
  (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
  ```

• **Grouped Query Attention (GQA)**: Q projections use full 2048 dimensions while K/V use only 512
  → **Lines**:
  ```
  (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
  (k_proj): Linear(in_features=2048, out_features=512, bias=False)
  (v_proj): Linear(in_features=2048, out_features=512, bias=False)
  ```

• **Higher Expansion Ratio**: MLP expands from 2048→8192 (4x ratio) but with GLU structure
  → **Lines**:
  ```
  (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
  (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
  ```

• **Bias-Free Design**: No bias parameters throughout - cleaner architecture
  → **Pattern**: `bias=False` appears in all Linear layers throughout the model

• **RMSNorm Normalization**: Uses RMS normalization instead of LayerNorm
  → **Lines**:
  ```
  (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
  (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
  (norm): LlamaRMSNorm((2048,), eps=1e-05)
  ```

• **Deeper Architecture**: 16 layers vs DistilGPT2's 6 - more **Depth Pruning opportunities**
  → **Line**: `(0-15): 16 x LlamaDecoderLayer(`

• **Rotary Position Embeddings**: Advanced positional encoding that works better for longer sequences
  → **Line**: `(rotary_emb): LlamaRotaryEmbedding()`

• **SiLU Activation**: Smooth activation function in MLP that influences how gating works in the GLU structure
  → **Line**: `(act_fn): SiLU()`

• **Perfect Width Pruning Candidate**: The dual-path MLP structure (gate_proj + up_proj) makes this architecture ideal for demonstrating advanced neuron selection strategies
  → **Lines**:
  ```
  (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
  (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
  ```

## google/gemma-3-270m

In [8]:
model = load_model("google/gemma-3-270m")

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/133 [00:00<?, ?B/s]

In [9]:
print(model)
clean_memory(model)

Gemma3ForCausalLM(
  (model): Gemma3TextModel(
    (embed_tokens): Gemma3TextScaledWordEmbedding(262144, 640, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x Gemma3DecoderLayer(
        (self_attn): Gemma3Attention(
          (q_proj): Linear(in_features=640, out_features=1024, bias=False)
          (k_proj): Linear(in_features=640, out_features=256, bias=False)
          (v_proj): Linear(in_features=640, out_features=256, bias=False)
          (o_proj): Linear(in_features=1024, out_features=640, bias=False)
          (q_norm): Gemma3RMSNorm((256,), eps=1e-06)
          (k_norm): Gemma3RMSNorm((256,), eps=1e-06)
        )
        (mlp): Gemma3MLP(
          (gate_proj): Linear(in_features=640, out_features=2048, bias=False)
          (up_proj): Linear(in_features=640, out_features=2048, bias=False)
          (down_proj): Linear(in_features=2048, out_features=640, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma3RMSNorm((640,), eps

**Gemma-3-270M - Key Insights Mapped to Model Structure:**

• **Compact GLU Architecture**: Modern GLU structure (gate_proj + up_proj → down_proj) like Llama but in a much smaller footprint
  → **Lines**:
  ```
  (gate_proj): Linear(in_features=640, out_features=2048, bias=False)
  (up_proj): Linear(in_features=640, out_features=2048, bias=False)
  (down_proj): Linear(in_features=2048, out_features=640, bias=False)
  ```

• **Grouped Query Attention with Normalization**: GQA structure (Q: 640→1024, K/V: 640→256) plus unique **q_norm and k_norm** layers
  → **Lines**:
  ```
  (q_proj): Linear(in_features=640, out_features=1024, bias=False)
  (k_proj): Linear(in_features=640, out_features=256, bias=False)
  (v_proj): Linear(in_features=640, out_features=256, bias=False)
  (q_norm): Gemma3RMSNorm((256,), eps=1e-06)
  (k_norm): Gemma3RMSNorm((256,), eps=1e-06)
  ```

• **Ultra-Lightweight Design**: Only 640 hidden dimensions and 270M parameters
  → **Pattern**: `in_features=640` appears throughout + `Gemma3TextScaledWordEmbedding(262144, 640, padding_idx=0)`

• **Conservative Expansion Ratio**: 3.2x MLP expansion (640→2048) - more modest than Llama's 4x
  → **Lines**:
  ```
  (gate_proj): Linear(in_features=640, out_features=2048, bias=False)
  (up_proj): Linear(in_features=640, out_features=2048, bias=False)
  ```

• **Excessive Normalization**: Four RMSNorm layers per decoder block - demonstrates modern architecture's emphasis on training stability
  → **Lines**:
  ```
  (input_layernorm): Gemma3RMSNorm((640,), eps=1e-06)
  (post_attention_layernorm): Gemma3RMSNorm((640,), eps=1e-06)
  (pre_feedforward_layernorm): Gemma3RMSNorm((640,), eps=1e-06)
  (post_feedforward_layernorm): Gemma3RMSNorm((640,), eps=1e-06)
  ```

• **Dual Rotary Embeddings**: Both standard and "local" rotary embeddings - advanced positional encoding
  → **Lines**:
  ```
  (rotary_emb): Gemma3RotaryEmbedding()
  (rotary_emb_local): Gemma3RotaryEmbedding()
  ```

• **GELU-Tanh Activation**: Uses PytorchGELUTanh instead of SiLU - shows activation function diversity
  → **Line**: `(act_fn): PytorchGELUTanh()`

• **Scaled Word Embeddings**: `Gemma3TextScaledWordEmbedding` indicates specialized embedding handling
  → **Line**: `(embed_tokens): Gemma3TextScaledWordEmbedding(262144, 640, padding_idx=0)`

• **Perfect Educational Model**: Small enough for laptop experimentation but sophisticated enough to demonstrate all modern techniques - ideal for **Depth Pruning** (18 layers) and **Width Pruning** examples
  → **Line**: `(0-17): 18 x Gemma3DecoderLayer(` + compact dimensions throughout


# Exploring Other Notable Architectures

## microsoft/Phi-4-mini-instruct

In [10]:
model = load_model("microsoft/Phi-4-mini-instruct")

config.json: 0.00B [00:00, ?B/s]

configuration_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-4-mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-4-mini-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.77G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [11]:
print(model)
clean_memory(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(200064, 3072, padding_idx=199999)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=5120, bias=False)
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
    (rotary_emb): Phi3RotaryEmbedding()
  )
  (lm_head): Linear(in_features=3072, out_features=20006

**Phi-4-mini-instruct - Key Insights Mapped to Model Structure:**

• **Deep Architecture**: 32 layers, demonstrates **extreme Depth Pruning potential** but requires sophisticated layer selection strategies
  → **Line**: `(0-31): 32 x Phi3DecoderLayer(`

• **Fused GLU Implementation**: `gate_up_proj` combines gating and up projection in a single matrix (3072→16384), then splits internally
  → **Line**: `(gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)`

• **Compact Multi-Head Attention**: Single `qkv_proj` (3072→5120) handles all Q, K, V projections
  → **Line**: `(qkv_proj): Linear(in_features=3072, out_features=5120, bias=False)`

• **Asymmetric Down Projection**: `down_proj` expects 8192 inputs but `gate_up_proj` outputs 16384 - confirms the internal splitting of the fused projection
  → **Lines**:
  ```
  (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
  (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
  ```

• **Residual Dropout Layers**: Explicit dropout for attention and MLP residual connections - additional regularization not seen in Llama
  → **Lines**:
  ```
  (resid_attn_dropout): Dropout(p=0.0, inplace=False)
  (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
  ```

• **Large Vocabulary**: 200K+ tokens vs Llama's 128K - shows how modern models handle larger tokenization schemes
  → **Line**: `(embed_tokens): Embedding(200064, 3072, padding_idx=199999)`

• **Zero Dropout Training**: All dropout probabilities set to 0.0 - indicates this is an inference-optimized checkpoint
  → **Pattern**: `p=0.0` in both `(resid_attn_dropout)` and `(resid_mlp_dropout)` lines


##Llama-3.1-8B-Instruct quantized.

In [12]:
from transformers import BitsAndBytesConfig

In [13]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

In [14]:
model_instruct_quantized = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.1-8B-Instruct",
        quantization_config=quantization_config,
        device_map="auto",
        torch_dtype=torch.float16
    )

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

In [15]:
print(model_instruct_quantized)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((409