This notebook compares the standard definition of Layer Norm with one based on GA.

Standard definitions:

$$c(x)=x-\mu(x).$$
$$u_\epsilon(x)=\frac{x}{\sqrt{||x||^2+\epsilon}}.$$ 
$$\mathrm{E}[x]=\mu(x)=\frac{1}{n}\sum_{i=1}^nx_i$$
$$\mathrm{Var}[x]=\sigma^2(x)=\frac{1}{n}\sum_{i=1}^n (x_i-\mu(x_i))^2.$$
Pytorch Layer Norm:
$$LN = \frac{x-\textrm{E}[x]}{\sqrt{\textrm{Var}[x]+\epsilon}}*\gamma+\beta.$$


In [6]:
using Random

N=4

#Initialise a random v element vector
Random.seed!(42)
u = Vector{Int}(undef,N)
rand!(u,-100:100)
u

4-element Vector{Int64}:
  26
 -10
  -5
  41

In [7]:
using SymbolicTransformer

LN(u)

4-element Vector{Float64}:
  0.6118070401602436
 -1.0824278402835077
 -0.8471174402218756
  1.3177382403451399

From [reexamine_layer_norm.ipynb](../notebooks/reexamine_layer_norm.ipynb) we have:

$$c(v) = \frac{1}{2}v - \frac{1}{2N} \vec{1}v\vec{1}$$

$$u_\epsilon(\vec{x})=\frac{\vec{x}}{\sqrt{||\vec{x}||^2+\epsilon}}$$

$$LN = \sqrt{n} U_{n \epsilon}(c(x))$$

In [8]:
using Grassmann

@basis S"++++"

ones = v₁ + v₂ + v₃ + v₄

c(x) = (1/2) * x - (1/8)*(ones * x * ones)

c (generic function with 1 method)

In [9]:

ϵ = 1e-6*N

u_Nϵ(x) = x * (1 / sqrt(x ⋅ x + ϵ))
ga_LN(x) = sqrt(N) * u_Nϵ(c(x))

ga_LN (generic function with 1 method)

In [10]:
ga_u = u ⋅ [v₁, v₂, v₃, v₄]

0 + 26v₁ - 10v₂ - 5v₃ + 41v₄

In [11]:
LN(u)

4-element Vector{Float64}:
  0.6118070401602436
 -1.0824278402835077
 -0.8471174402218756
  1.3177382403451399

In [12]:
ga_LN(ga_u)

0.0 + 0.6118070462579881v₁ - 1.082427851071825v₂ - 0.8471174486649066v₃ + 1.3177382534787434v₄

$$u_\epsilon (c(v)) = \frac{\frac{1}{2}v - \frac{1}{2N} \vec{1}v\vec{1}}{\sqrt{\frac{1}{2} |v|^2 - \frac{1}{4N} |v \vec{1}|^2 - \frac{1}{4N} |\vec{1} v |^2  + \epsilon}}$$

$$LN(v) = \sqrt{n} \frac{\frac{1}{2}v - \frac{1}{2N} \vec{1} v \vec{1}}
{\sqrt{\frac{1}{2} |v|^2 - \frac{1}{4N} |v \vec{1}|^2 - \frac{1}{4N} |\vec{1} v |^2  + n * \epsilon}}$$


In [18]:
scale(x) = sqrt(N) / sqrt( (1/2) * (x ⋅ x) - (1/(4*N)) * ((v * ones) ⋅ (v * ones)) - (1/(4*N)) * ((ones * v) ⋅ ( ones * v)) + N * ϵ )
direction(x) = (1/2) * x - (1/(2*N)) * (ones * v * ones)
ga_LN2(x) = scale(x) * direction(x)

ga_LN2 (generic function with 1 method)

In [19]:
ga_LN2(ga_u)

-0.02839236783843104 + 0.7382015637992071v₁ - 0.28392367838431043v₂ - 0.14196183919215521v₃ + 1.1640870813756727v₄

In [15]:
scale(ga_u)

0.05678473567686208v

In [16]:
#from LayerNormalization.jl
μ(v) = sum(v)/size(v,1)
(μ((u .- μ(u)).^2) + ϵ).^(-0.5)

0.04706208032503127