In [32]:
using Symbolics
using Latexify
using Line

include("../data/probe_token.jl")
include("../data/pre_norm.jl")

N=512

μ(x) = sum(x) / N
E(x) = μ(x) 

c(x) = x .- μ(x)

#var(x) = sum(c(x) .^2 )
#var(x) = sum((x .- μ(x)) .^2 )
var(x) = sum((x .- μ(x)) .^2 )/N

ϵ = 1e-5



1.0e-5

[ReExaminingLayerNorm.ipynb](https://colab.research.google.com/drive/1S39-w4vzX3VzZx_27X_BtrLs442pOJnJ) (also [described on LessWrong](https://www.lesswrong.com/posts/jfG6vdJZCwTQmG7kb/re-examining-layernorm) ) describes the following as definition for layer-norm from PyTorch

In [33]:
LN(x) = (x .- E(x))/sqrt(var(x) + ϵ) 


LN (generic function with 1 method)

If it is equivalent this should return 11.4077

In [34]:
bias = 0.8328

final_residual = LN(pre_norm)

logit = sum(.*(probe_token, final_residual)) + bias


11.407851912178797

Following the notebook

In [37]:
norm(x) = sqrt(sum(x .^ 2))
u_ϵ(x) = x .* (1/sqrt(norm(x)^2 + ϵ) )

u_ϵ (generic function with 1 method)

$$\sqrt{n} \cdot u_{n \epsilon}(x) = \frac{x}{\sqrt{\textrm{Var}[x] + \epsilon}}$$

In [38]:
final_residual = sqrt(512) .* u_ϵ(pre_norm)


logit = sum(.*(probe_token, final_residual)) + bias

11.40785559974425

$$LN = \sqrt{n} \cdot U_{n \epsilon}(c(x))$$

In [40]:
u_nϵ(x) = x .* (1/sqrt(norm(x)^2 + (512*ϵ)) )


final_residual = sqrt(512) .* u_nϵ(c(pre_norm))


logit = sum(.*(probe_token, final_residual)) + bias

11.407851912178797

The standard definition of $mean(v) = \mu(v)$ is:

$$\mu(v) = \frac{sum(v)}{N}$$
where $sum(v)$ is the sum over components of v, and $N$ is number of components of $v$, which results in a scalar value.

This can be stated using the dot product with the vector $\vec{1} = \{ 1,1,1,...1 \}$: $$\mu(v) = \frac{<\vec{1},v>}{N}$$

The centering operation deducts $\mu(v)$ from each component of $v$:
$$c(v) = v - μ(v) \vec{1}$$

Going back to the earlier definition

$$LN = \sqrt{N}  u_{N \epsilon}(c(v))$$

In terms of Geometric Algebra $<u,v>  = \frac{1}{2} (uv + vu)$ so we can rewrite

$$c(v) = v - \frac{<\vec{1} , v>}{N} \vec{1}
= v - (\frac{\vec{1} v + v \vec{1}}{2N}) \vec{1}$$

Can we go as far as this?
$$ = v - \frac{\vec{1} v \vec{1} + v |\vec{1}|^2}{2N}$$
$$ = v - \frac{\vec{1}v\vec{1}}{2N} - \frac{N}{2N}v$$
$$ = \frac{1}{2}v - \frac{1}{2N} \vec{1}v\vec{1}$$

This is checked for N=4 in [center.ipynb](../geometry/center.ipynb)


Expand

$$<c(v), c(v)> = \frac{1}{4}v^2 - \frac{1}{4N}v\vec{1}v\vec{1} - \frac{1}{4N}\vec{1}v\vec{1}v + \frac{1}{4N^2} \vec{1} v \vec{1} \vec{1} v \vec{1}$$



But $$\vec{1} v \vec{1} \vec{1} v \vec{1} = N \vec{1} v v \vec{1} = N |v|^2 \vec{1}\vec{1} = N^2 |v|^2$$ 
So
$$ <c(v), c(v)> = \frac{1}{4} |v|^2 - \frac{1}{4N} |v \vec{1}|^2 - \frac{1}{4N} |\vec{1} v |^2 + \frac{1}{4} |v|^2 $$
$$ = \frac{1}{2} |v|^2 - \frac{1}{4N} |v \vec{1}|^2 - \frac{1}{4N} |\vec{1} v |^2  $$



$$u_{\epsilon}(v) = \frac{v}{\sqrt{<v,v> + \epsilon} }$$

$$u_\epsilon (c(v)) = \frac{\frac{1}{2}v - \frac{1}{2N} \vec{1}v\vec{1}}{\sqrt{\frac{1}{2} |v|^2 - \frac{1}{4N} |v \vec{1}|^2 - \frac{1}{4N} |\vec{1} v |^2  + \epsilon}}$$

Low confidence in this - I haven't checked it, and it might simplify to something obvious.

Next steps - use $u_\epsilon$ formula above in LN through projection matrices for attention heads.

## Other Notes

$$μ(v) = \frac{ <v , \vec{1}> }{N}$$

But since $|\vec{1}| = \sqrt{N}$  
$$μ(v) =  \frac{ |v| \sqrt{N} \cos{\theta_{v,\vec{1}}}}{N}   
    = \frac{|v| \cos{\theta_{v,\vec{1}}}}{\sqrt{N}} $$

where $\theta_{v,\vec{1}}$ is the angle between $v$ and $\vec{1}$

Applying layer normalization results in a vector of approx unit length, in the direction of the centered vector.

$$LN(v) = \sqrt{N} u_{N \epsilon}(c(v))$$

The inner product between 2 vectors $a$ and $b$, with LN applied to the second
$$< a, LN b> \approx |a| \cos{\theta_{a,c(b)}} $$

If $b$ is understood as the sum of several vectors, they can be analysed in terms of
how they contribute to the angle of the centered vector.