<a href="https://colab.research.google.com/github/peremartra/Rearchitecting-LLMs/blob/main/CH03/CH03_NB01_Model_structures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Rearchitecting LLMs
## Surgical Optimization for Hyper-Efficient Models


### Chapter 3: The Internal Structure of Modern Transformers
by [Pere Martra](https://github.com/peremartra)
_____
Colab Environment: GPU T4

Models:
* distilgpt2
* meta-llama/Llama-3.2-1B
* google/gemma-3-270m
* microsoft/Phi-4-mini-instruct
* Qwen/Qwen3-0.6B"

_____

In this notebook you'll find the structure of different transformer models.Retry

In [None]:
# Análisis de Estructuras de Modelos de Lenguaje
# Notebook optimizado para Google Colab con gestión eficiente de memoria

# Instalar dependencias necesarias
#!pip install transformers torch accelerate sentencepiece

!pip install -q \
      "torch==2.8.0+cu126" \
      "transformers==4.55.4" \
      "accelerate==1.10.1" \

import torch
import gc
#import psutil
#import os
from transformers import AutoModel, AutoTokenizer, AutoConfig
#from transformers.models.auto.modeling_auto import MODEL_MAPPING
import warnings
warnings.filterwarnings("ignore")

In [None]:
def clean_memory(model):
    """Limpia memoria"""
    del model
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

In [None]:
def load_model(model_name):
    # Cargar modelo
    model = AutoModel.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="cpu",  # Cargar en CPU para ahorrar memoria GPU
        trust_remote_code=True
    )

    return model

#DistilGPT2

In [None]:
# Procesar cada modelo
model = load_model("DistilGPT2")

In [None]:
print(model)
clean_memory(model)

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-5): 6 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D(nf=2304, nx=768)
        (c_proj): Conv1D(nf=768, nx=768)
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D(nf=3072, nx=768)
        (c_proj): Conv1D(nf=768, nx=3072)
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)


##DistilGPT2 - Key Characteristics:

• **Classical Stacked Architecture**: 6 identical GPT2Block layers - perfect example for **Depth Pruning** opportunities (removing entire blocks)

• **Traditional MLP Design**: Simple linear transformation (768→3072→768) with 4x expansion ratio - no gating mechanism like modern GLU architectures

• **Multi-Head Attention**: Classic attention implementation using Conv1D layers (c_attn projects to 2304 dimensions for Q, K, V)

• **Pre-Norm Architecture**: LayerNorm applied before attention (ln_1) and MLP (ln_2), not after - affects how information flows

• **Compact Embedding Space**: 768-dimensional hidden states with 50,257 vocabulary tokens

• **Moderate Parameter Count**: ~82M parameters - small enough for experimentation but complex enough to demonstrate techniques

• **Classical Width Pruning Target**: The MLP's c_fc layer (768→3072) represents the traditional approach to width pruning, though less effective than modern GLU variants

# meta-llama/Llama-3.2-1B

In [None]:
model = load_model("meta-llama/Llama-3.2-1B")

In [None]:
print(model)
clean_memory(model)

LlamaModel(
  (embed_tokens): Embedding(128256, 2048)
  (layers): ModuleList(
    (0-15): 16 x LlamaDecoderLayer(
      (self_attn): LlamaAttention(
        (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
        (k_proj): Linear(in_features=2048, out_features=512, bias=False)
        (v_proj): Linear(in_features=2048, out_features=512, bias=False)
        (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
      )
      (mlp): LlamaMLP(
        (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
        (act_fn): SiLU()
      )
      (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
    )
  )
  (norm): LlamaRMSNorm((2048,), eps=1e-05)
  (rotary_emb): LlamaRotaryEmbedding()
)


## Llama-3.2-1B - Key Characteristics:

• **Modern GLU Architecture**: MLP uses **gating mechanism** (gate_proj + up_proj → down_proj) - this is where **Width Pruning becomes extremely powerful** and will be our focus in Chapter 5

• **Grouped Query Attention (GQA)**: Q projections use full 2048 dimensions while K/V use only 512 - more efficient than Multi-Head Attention but adds implementation complexity

• **Higher Expansion Ratio**: MLP expands from 2048→8192 (4x ratio) but with GLU structure, creating massive pruning opportunities in the gate and up projections

• **Bias-Free Design**: No bias parameters throughout - cleaner architecture and fewer parameters to manage during pruning

• **RMSNorm Normalization**: Uses RMS normalization instead of LayerNorm - more efficient computation and better training stability

• **Deeper Architecture**: 16 layers vs DistilGPT2's 6 - more **Depth Pruning opportunities** but requires careful layer selection

• **Rotary Position Embeddings**: Advanced positional encoding that works better for longer sequences - affects how attention patterns form

• **SiLU Activation**: Smooth activation function in MLP that influences how gating works in the GLU structure

• **Perfect Width Pruning Candidate**: The dual-path MLP structure (gate_proj + up_proj) makes this architecture ideal for demonstrating advanced neuron selection strategies

# google/gemma-3-270m

In [None]:
model = load_model("google/gemma-3-270m")

In [None]:
print(model)
clean_memory(model)

Gemma3TextModel(
  (embed_tokens): Gemma3TextScaledWordEmbedding(262144, 640, padding_idx=0)
  (layers): ModuleList(
    (0-17): 18 x Gemma3DecoderLayer(
      (self_attn): Gemma3Attention(
        (q_proj): Linear(in_features=640, out_features=1024, bias=False)
        (k_proj): Linear(in_features=640, out_features=256, bias=False)
        (v_proj): Linear(in_features=640, out_features=256, bias=False)
        (o_proj): Linear(in_features=1024, out_features=640, bias=False)
        (q_norm): Gemma3RMSNorm((256,), eps=1e-06)
        (k_norm): Gemma3RMSNorm((256,), eps=1e-06)
      )
      (mlp): Gemma3MLP(
        (gate_proj): Linear(in_features=640, out_features=2048, bias=False)
        (up_proj): Linear(in_features=640, out_features=2048, bias=False)
        (down_proj): Linear(in_features=2048, out_features=640, bias=False)
        (act_fn): PytorchGELUTanh()
      )
      (input_layernorm): Gemma3RMSNorm((640,), eps=1e-06)
      (post_attention_layernorm): Gemma3RMSNorm((640,), ep

## Gemma-3-270m - Key Characteristics:

• **Compact GLU Architecture**: Modern GLU structure (gate_proj + up_proj → down_proj) like Llama but in a much smaller footprint - ideal for **Width Pruning experimentation** without heavy computational costs

• **Grouped Query Attention with Normalization**: GQA structure (Q: 640→1024, K/V: 640→256) plus unique **q_norm and k_norm** layers - shows attention evolution beyond basic GQA

• **Ultra-Lightweight Design**: Only 640 hidden dimensions and 270M parameters - perfect for rapid prototyping and educational demonstrations

• **Conservative Expansion Ratio**: 3.2x MLP expansion (640→2048) - more modest than Llama's 4x, showing how different architectures balance efficiency vs capacity

• **Excessive Normalization**: Four RMSNorm layers per decoder block (input, post-attention, pre-feedforward, post-feedforward) - demonstrates modern architecture's emphasis on training stability

• **Dual Rotary Embeddings**: Both standard and "local" rotary embeddings - advanced positional encoding for better sequence understanding

• **GELU-Tanh Activation**: Uses PytorchGELUTanh instead of SiLU - shows activation function diversity in modern architectures

• **Scaled Word Embeddings**: `Gemma3TextScaledWordEmbedding` indicates specialized embedding handling - affects how token representations are initialized

• **Perfect Educational Model**: Small enough for laptop experimentation but sophisticated enough to demonstrate all modern techniques - ideal for **Depth Pruning** (18 layers) and **Width Pruning** examples


# microsoft/Phi-4-mini-instruct

In [None]:
model = load_model("microsoft/Phi-4-mini-instruct")

In [None]:
print(model)
clean_memory(model)

Phi3Model(
  (embed_tokens): Embedding(200064, 3072, padding_idx=199999)
  (layers): ModuleList(
    (0-31): 32 x Phi3DecoderLayer(
      (self_attn): Phi3Attention(
        (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
        (qkv_proj): Linear(in_features=3072, out_features=5120, bias=False)
      )
      (mlp): Phi3MLP(
        (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
        (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
        (activation_fn): SiLU()
      )
      (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
      (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
      (resid_attn_dropout): Dropout(p=0.0, inplace=False)
      (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
    )
  )
  (norm): Phi3RMSNorm((3072,), eps=1e-05)
  (rotary_emb): Phi3RotaryEmbedding()
)
