# GPT-2 
---


In [1]:
import tensorflow as tf

In [2]:
tf.enable_eager_execution()

## Normalization

Some fluff first:

In [3]:
x = tf.constant([[1.,2.,3.],[4.,5.,6.]])
print('orig', x.numpy(), sep='\n', end='\n----------\n')
y = x - 2.
print('orig -2', y.numpy(), sep='\n', end='\n----------\n')
ysq = tf.square(y)
print('squared', ysq.numpy(), sep='\n', end='\n----------\n')
print('sqr = x^1/2', tf.sqrt(ysq).numpy(), sep='\n', end='\n----------\n')
print('rsqr: 1/(x^1/2), the inf at [0,1] being the reason why they use epsilon', tf.rsqrt(ysq).numpy(), sep='\n', end='\n----------\n') 
print('rsqr: 1/(x^1/2), with epsilon', tf.rsqrt(ysq + 1e-5).numpy(), sep='\n', end='\n----------\n') 

orig
[[1. 2. 3.]
 [4. 5. 6.]]
----------
orig -2
[[-1.  0.  1.]
 [ 2.  3.  4.]]
----------
squared
[[ 1.  0.  1.]
 [ 4.  9. 16.]]
----------
sqr = x^1/2
[[1. 0. 1.]
 [2. 3. 4.]]
----------
rsqr: 1/(x^1/2), the inf at [0,1] being the reason why they use epsilon
[[1.                inf 1.        ]
 [0.5        0.33333334 0.25      ]]
----------
rsqr: 1/(x^1/2), with epsilon
[[9.9999499e-01 3.1622778e+02 9.9999499e-01]
 [4.9999940e-01 3.3333313e-01 2.4999994e-01]]
----------


### The function

Deets:
- axis=-1: reduce_mean applied to innermost dimension.  
- shape[-1]: same  
- using rsqrt instead of 1/sqrt

The batch norm equations, from the [original paper](https://arxiv.org/pdf/1502.03167v3.pdf), found [here](https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c):  
![norm](batch_norm.png "Batch Normalization Equations")

See also these videos by Andrew Ng:
- [Normalizing inputs](https://www.youtube.com/watch?v=FDCfw-YqWTE&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=9)
- [Batch Normalization](https://www.youtube.com/watch?v=tNIpEZLv_eg&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=27).


N.B: a * as as a function argument means: 'no positional arguments after this point', cf. [here](https://stackoverflow.com/a/53797072) and [there](https://www.python.org/dev/peps/pep-3102/).

In [4]:
def norm(x, scope, *, axis=-1, epsilon=1e-5):
    """Normalize to mean = 0, std = 1, then do a diagonal affine transform."""
    with tf.variable_scope(scope):
        # take the innermost dimension
        n_state = x.shape[-1].value
        
        # weight & bias that will be trained
        g = tf.get_variable('g', 
                            [n_state], 
                            initializer=tf.constant_initializer(1))
        b = tf.get_variable('b', 
                            [n_state], 
                            initializer=tf.constant_initializer(0))
        
        # take the absolute mean
        u = tf.reduce_mean(x, axis=axis, keepdims=True)
        # take the variance
        s = tf.reduce_mean(tf.square(x-u), axis=axis, keepdims=True)
        # normalization
        x = (x - u) * tf.rsqrt(s + epsilon)
        # scaling & shifting using weight & bias
        x = x*g + b
        return x