# Accounting for the transformer's operations

Problem (transformer_accounting): Transformer LM resource accounting (5 points)
(a) Consider GPT-2 XL, which has the following configuration:

vocab_size : 50,257

context_length : 1,024

num_layers : 48

d_model : 1,600

27num_heads : 25

d_ff : 6,400

Suppose we constructed our model using this configuration. How many trainable parameters
would our model have? Assuming each parameter is represented using single-precision floating
point, how much memory is required to just load this model?
Deliverable: A one-to-two sentence response.

In [8]:
vocab_size = 50257
context_length = 1024
d_model = 1600
num_heads = 25
num_layers = 48
d_ff = 6400


In [16]:
# 1. Token Embeddings
# The token embeddings layer maps input tokens to vectors of size d_model.

token_embedding_parameters = vocab_size * d_model
token_embedding_parameters


80411200

In [17]:
# 2. Transformer block

#Attention
W_q = d_model * d_model
W_k = d_model * d_model
W_v = d_model * d_model
W_o = d_model * d_model

attention = W_q+W_k+W_v+W_o

# Feed Forward
W1 = d_model*d_ff
W2 = d_ff*d_model
W3 = d_model*d_ff

FFN = W1+W2+W3

layer_norm1 = layer_norm2 = d_model

# Total Transformer
total = (attention+FFN+layer_norm1+layer_norm2)*num_layers
total

1966233600

In [18]:
# Rest

#final norm
norm_final = d_model

# LLM head
llm_head = d_model*vocab_size

rest = llm_head+norm_final

rest

80412800

In [19]:
(rest+total+token_embedding_parameters), (rest+total+token_embedding_parameters)*4/1024/1024/1024

(2127057600, 7.923907041549683)


(b) Identify the matrix multiplies required to complete a forward pass of our GPT-2 XL-shaped
model. How many FLOPs do these matrix multiplies require in total? Assume that our input
sequence has context_length tokens.

Deliverable: A list of matrix multiplies (with descriptions), and the total number of FLOPs
required.



In [20]:
4.5e+12

4500000000000.0

(c) Based on your analysis above, which parts of the model require the most FLOPs?

Deliverable: A one-to-two sentence response.



In [21]:
# The feed forward networks (SwiGLU) require the most FLOPs, consuming approximately 69.4% of computation per block due to the large d_ff dimension (6,400). 
# The attention mechanism follows at about 30.6% per block, while the final LM head contributes 3.6% of total FLOPs.

(d) Repeat your analysis with GPT-2 small (12 layers, 768 d_model, 12 heads), GPT-2 medium (24
layers, 1024 d_model, 16 heads), and GPT-2 large (36 layers, 1280 d_model, 20 heads). As the
model size increases, which parts of the Transformer LM take up proportionally more or less of
the total FLOPs?

Deliverable: For each model, provide a breakdown of model components and its associated
FLOPs (as a proportion of the total FLOPs required for a forward pass). In addition, provide a
one-to-two sentence description of how varying the model size changes the proportional FLOPs
of each component.



In [22]:
# GPT-2 Small (12 layers, 768 d_model, 12 heads):

# d_ff = 2,048 (rounded to nearest 64)
# Per block FLOPs: ~19.4B
# Total FLOPs: ~292B
# FF proportion: ~54.5%
# Attention proportion: ~45.5%
# LM head proportion: ~27.1%

# GPT-2 Medium (24 layers, 1024 d_model, 16 heads):

# d_ff = 2,688
# Per block FLOPs: ~26.7B
# Total FLOPs: ~821B
# FF proportion: ~56.8%
# Attention proportion: ~43.2%
# LM head proportion: ~12.8%

# GPT-2 Large (36 layers, 1280 d_model, 20 heads):

# d_ff = 3,392
# Per block FLOPs: ~42.8B
# Total FLOPs: ~1.77T
# FF proportion: ~58.7%
# Attention proportion: ~41.3%
# LM head proportion: ~7.4%

# GPT-2 XL (48 layers, 1600 d_model, 25 heads):

# d_ff = 4,224 (note: your code uses 6,400, but standard would be ~4,224)
# Per block FLOPs: ~63.0B
# Total FLOPs: ~3.49T
# FF proportion: ~60.0%
# Attention proportion: ~40.0%
# LM head proportion: ~4.7%

(e) Take GPT-2 XL and increase the context length to 16,384. How does the total FLOPs for one
forward pass change? How do the relative contribution of FLOPs of the model components
change?

Deliverable: A one-to-two sentence response.

In [None]:
# Scaling Effects:

# Linear operations (projections, FF): Scale linearly with sequence length (16× increase)
# Attention computation: Scales quadratically with sequence length (256× increase)
# LM head: Scales linearly with sequence length (16× increase)

# Total FLOPs: ~133 trillion FLOPs (38.2× increase from base)
# New FLOP Distribution:

# Feed Forward: Now only ~24.4% of total FLOPs (down from 60.0%)
# Attention: Now dominates at ~75.6% of total FLOPs (up from 40.0%)
# LM head: Minimal contribution due to the massive increase in attention costs

# With extended context length, attention computation becomes the dominant bottleneck due to its O(S²) complexity, 
# fundamentally shifting the computational profile from feed-forward-dominated to attention-dominated.