# Symbolic Transformer

This project is intended to support describing and reasoning about the internals of Transformer Language Model algorithms - keeping a close link between code and notation. Initial work is based on the Pythia models from Eleuther AI and Neel Nandas Transformer Lens analysis tool. 

# Terminology and Conventions

Analysis is focussed on individual residual streams, that is the vectors at a given position in particular layers of a transformer. See descriptions in https://transformer-circuits.pub/2021/framework/index.html and https://www.neelnanda.io/mechanistic-interpretability/glossary

The notation aims to be consistent with that described in https://transformer-circuits.pub/2021/framework/index.html#notation but there are deviations.

## Indexing Conventions

A **transformer block** is the combination of attention layer and MLP (with their associated layer normalizations). Superscripts refer to the block number:
- Block 0: the embedding layer (before any transformer blocks)
- Blocks 1 to $L$: transformer blocks ($L=6$ for pythia-70m-deduped)

**Position** refers to token ordering within the context window. Subscripts indicate position:
- Position 0: first token
- Position $n-1$: last token ($n=2048$ for pythia-70m-deduped)

## Core Notation

| Symbol | Meaning |
|--------|---------|
| $x^i_j$ | Residual vector after block $i$, at position $j$ |
| $\underline{\text{token}}$ | Token embedding vector (row of $W_E$) |
| $\overline{\text{token}}$ | Unembedding vector (row of $W_U$) |
| $\Delta x^i_j$ | Contribution from block $i$ to residual at position $j$ |
| $W_E$, $W_U$ | Embedding and unembedding matrices |

## Operators

| Symbol | Meaning |
|--------|---------|
| $LN$ | Layer normalization |
| $A^{i,h}$ | Attention pattern matrix for layer $i$, head $h$ |
| $\tilde{A}^i$ | Attention sub-layer including pre-norm: $\tilde{A}^i(x) = \text{Attn}^i(LN(x))$ |
| $M^i$ | MLP layer $i$ |
| $\tilde{M}^i$ | MLP sub-layer including pre-norm: $\tilde{M}^i(x) = M^i(LN(x))$ |
| $B_i$ | Full block $i$ (see below) |

The tilde notation $\tilde{A}$, $\tilde{M}$ indicates sub-layers with their preceding layer normalization absorbed.

# Analysis

When inference is run, the unembedding weights are applied to the residual vectors from the final layer to 
generate a logit (i.e. an unbounded output weight which can be normalized across all tokens to give each a
probability between 0 and 1). The predicted token is selected from those with the greatest logits. During training
the loss for gradient descent is calculated based on logit of the actual next token.

For output residual vector $x_j^L$ at position $j$, the logits are:
$$
W_U \cdot LN(x_j^L)
$$

The logit for a particular token (say '$\overline{\text{ublin}}$') is the dot product:
$$
\langle \overline{\text{ublin}}, LN(x_j^L) \rangle
$$

## Residual Stream Structure

The layers of the transformer add to the residual stream:
$$
x_j^L = x_j^0 + \sum_{i=1}^{L} \Delta x_j^i
$$

where $x_j^0 = \underline{\text{token}_j} + p_j$ combines the token embedding with positional embedding.

## Block Structure (Pre-Norm)

For Pythia and similar pre-norm architectures, each block applies layer normalization *before* each sub-layer. Define the **block operator**:

$$
B_i(x) = x + \tilde{A}^i(x) + \tilde{M}^i(x + \tilde{A}^i(x))
$$

where:
- $\tilde{A}^i(x) = \text{Attn}^i(LN(x))$ — attention with pre-norm
- $\tilde{M}^i(y) = M^i(LN(y))$ — MLP with pre-norm

The full forward pass is then:
$$
x_j^L = B_L \circ B_{L-1} \circ \cdots \circ B_1 (x_j^0)
$$

with a final layer norm before unembedding: $\text{logits}_j = W_U \cdot LN(x_j^L)$

## Block Contribution

The contribution from block $i$ can be decomposed:
$$
\Delta x_j^i = \tilde{A}^i(x_j^{i-1}) + \tilde{M}^i(x_j^{i-1} + \tilde{A}^i(x_j^{i-1}))
$$

Note: The MLP sees the residual *after* attention has been added, so these terms are not independent.

### Non-Linear Operator Properties

- **Attention**: A data-dependent linear map—the pattern $A^{i,h}$ depends on the input, making the overall operation non-linear in the input
- **MLP**: Position-wise non-linear transformation (GeLU activation in Pythia)
- **Layer Norm**: Non-linear scaling that projects onto a sphere (after centering), breaking superposition of contributions

Because each $B_i$ has the structure $x + f_i(x)$ (adding a perturbation rather than fully transforming), contributions from different layers can be meaningfully compared in the shared residual space.

# Layer Normalization and Dot Product

$$\langle v_1, LN(v_2) \rangle \approx |v_1| \cos{\theta_{v_1,c(v_2)}} $$
where $\theta_{a,b}$ is the angle between vectors $a$ and $b$, and $c(v)$ is the centering operation described in reexamine_layer_norm.ipynb

Each vector sum can be considered in 2d, where it either increases or decreases $\theta$. By building up the sum of vectors on the right we can see which vectors contribute to the angle used in the final dot product.

# Residual Space Structure

## Dual Pairing

The unembedding vectors act as **linear functionals** on the residual space. For any residual $x$:
$$
\text{logit}_{\text{token}}(x) = \langle \overline{\text{token}}, x \rangle
$$

This is a dual pairing: $\overline{\text{token}} \in V^*$ (dual space) acts on $x \in V$ (residual space). The embedding vectors $\underline{\text{token}} \in V$ live in the primal space.

While embedding/unembedding vectors don't form a basis (vocabulary size $\gg d_{\text{model}}$ typically), they define **privileged directions** for interpretation. The success of logit lens and tuned lens relies on $\langle \overline{\text{token}}, x^i_j \rangle$ being meaningful at intermediate layers.

## Feature Types in the Residual Stream

Residual vectors can be decomposed by their functional role. At position $j$, layer $i$, we can ask what directions in $x^i_j$ contribute to:

| Feature Type | Detection Method | Mathematical Signature |
|--------------|------------------|------------------------|
| **Next-token predictors** | Direct logit attribution | $\langle \overline{\text{token}}, x^i_j \rangle$ large |
| **Attention attractors** | Key-query analysis | $\langle W_K x^i_j, W_Q x^i_k \rangle$ large for $k > j$ |
| **Position influencers** | OV-circuit attribution | $W_{OV} x^i_j$ contributes to $\Delta x^{i'}_k$ for $k > j$, $i' > i$ |

These are not orthogonal—a single direction may serve multiple roles. But this decomposition helps trace *how* information flows:
1. Token $\rightarrow$ embedding $\rightarrow$ residual
2. Residual $\rightarrow$ attention attractor $\rightarrow$ copied to later position
3. Residual $\rightarrow$ predictor $\rightarrow$ logit

## Notation for Attribution

For tracing contributions, we can write:
- $x^i_j \xrightarrow{A^{i',h}} x^{i'}_k$: attention head $h$ at layer $i'$ copies from position $j$ to $k$
- $x^i_j \xrightarrow{W_U} \overline{\text{token}}$: residual contributes to token logit

This directed notation complements the algebraic expressions by showing causal flow.