### About Xavier initialization
The aim of weight initialization is to prevent layer activation outputs from exploding or vanishing during the course of a forward pass through a deep neural network.

If either occurs, loss gradients will either be too large or too small to flow backwards beneficially, and the network will take longer to converge, if it is even able to do so at all.

#### Definition

Xavier initialization sets a layer’s weights to values chosen from a random uniform distribution that’s bounded between:
    sqrt(6) / sqrt(n_i + n_o)
    
#### Even better?

In their 2015 paper, He et. al. demonstrated that deep networks (e.g. a 22-layer CNN) would converge much earlier if the following input weight initialization strategy is employed:

  - Create a tensor with the dimensions appropriate for a weight matrix at a given layer, and populate it with numbers randomly chosen from a standard normal distribution.
  - Multiply each randomly chosen number by √2/√n where n is the number of incoming connections coming into a given layer from the previous layer’s output (also known as the “fan-in”).
  - Bias tensors are initialized to zero.

# 1. Define environment and model

In [1]:
import random
import gym
import numpy as np


# create cart pole environment
ENV_NAME = "CartPole-v1"
env = gym.make(ENV_NAME)

# get main parameters from the agent-environment interaction

number_of_actions = env.action_space.n
state_dim = env.observation_space.shape[0]



In [2]:
# Model parameters
H = 8 
A = number_of_actions
S = state_dim

# Model weights
model = {}
model['W1'] = np.random.randn(H,S) / np.sqrt(S) # "Xavier" initialization
model['W2'] = np.random.randn(A,H) / np.sqrt(H)

# 2. Define useful functions for training and evaluating the network

In [3]:
def softmax(y):
    if max(y) > 10:
        print("WARNING: NUMEROS GORDACOS")
        print(y)
    return np.exp(y)/np.sum(np.exp(y))

def propagate_forward(x):
    global model
    h = np.dot(model['W1'], x)
    h[h<0] = 0
    y = np.dot(model['W2'], h)
    p = softmax(y)

    return h,y,p

def policy_gradient(action, reward, y, p):
    grad = -p
    grad[action] += 1.
    return grad*reward


In [4]:
# sample input
x = np.array([-0.18903734, -1.75218701,  0.21354577,  2.74911676])

h,y,p = propagate_forward(x)
dy = policy_gradient(0,1, y, p)

## 2.1 Backprop

In [5]:
# we are lucky that as we have dense layers, the gradient is linear

# W2 update
dw2 = np.outer(dy, h)
## that means that if we ONLY update W2 += dw2, now forward propagating
## the same input x will yield the 'perfect' policy value for this example (proof below)
## W2 ready :)



# W1 update
dh = np.dot(dy, model['W2'])
dh[h==0] = 0
## now we have how much we need to increase the hidden state: h = h+lr*dh
dw1 = np.outer(dh, x)
## W1 ready :)


#### Tests: 
#### 1. After only updating W2 and evaluating policy gradient again


In [6]:
print(model['W2'])

[[ 0.27323657  0.15243403 -0.36531208  0.04196286  0.37061983  0.22093565
  -0.22469356  0.17293585]
 [ 0.16214547 -0.49864699  0.71875238  0.72459368  0.05994918  0.1239246
   0.065119    0.18102478]]


In [7]:
h,y,p = propagate_forward(x)
dy = policy_gradient(0,1, y, p)
print(dy)

[ 0.39907043 -0.39907043]


In [8]:
# W2 update
dw2 = np.outer(dy, h)
model['W2'] += dw2
h,y,p = propagate_forward(x)
dy = policy_gradient(0,1, y, p)
print(dy)

[ 8.31339330e-08 -8.31339329e-08]


##### The policy gradient is 0 :)


##### Of course, this update is crazy, as you force to perfectly fit the optimal policy for the current example

#### 2. After only updating W1 and evaluating policy gradient again

In [9]:
model = {}
model['W1'] = np.random.randn(H,S) / np.sqrt(S) # "Xavier" initialization
model['W2'] = np.random.randn(A,H) / np.sqrt(H)

h,y,p = propagate_forward(x)
dy = policy_gradient(0,1, y, p)

print(dy)

[ 0.38516315 -0.38516315]


In [10]:
# W1 update
dh = np.dot(dy, model['W2'])
dh[h==0] = 0
## now we have how much we need to increase the hidden state: h = h+lr*dh
dw1 = np.outer(dh, x)
## W1 ready :)

model['W1'] += dw1

In [11]:
h,y,p = propagate_forward(x)
dy = policy_gradient(0,1, y, p)
print(dy)

[ 0.09442277 -0.09442277]


#### The policy gradient is not 0 because the RELU keeps the necessary neurons from activating


### Anyway, the final backprop code


In [12]:
def backprop(dy, h, x):
    global model
    # we are lucky that as we have dense layers, the gradient is linear

    # W2 update
    dw2 = np.outer(dy, h)
    ## that means that if we ONLY update W2 += dw2, now forward propagating
    ## the same input x will yield the 'perfect' policy value for this example (proof below)
    ## W2 ready :)

    # W1 update
    dh = np.dot(dy, model['W2'])
    dh[h<=0] = 0
    ## now we have how much we need to increase the hidden state: h = h+lr*dh
    dw1 = np.outer(dh, x)
    ## W1 ready :)
    return dw1, dw2

## 2.2 Rewards
As in Karpathy's Pong from Pixels, we normalize the rewards obtained in each batch

In [13]:
model = {}
model['W1'] = np.random.randn(H,S) / np.sqrt(S*H) # "Xavier" initialization
model['W2'] = np.random.randn(A,H) / np.sqrt(H*A)


gamma = 0.97
gamma_series = [1]
for e in range(1000):
    gamma_series.append(gamma_series[-1]*gamma + 1)


model = {}
model['W1'] = np.random.randn(H,S) / np.sqrt(S*H) # "Xavier" initialization
model['W2'] = np.random.randn(A,H) / np.sqrt(H*A)

learning_rate = 0.0001
count_episodes = 0

gradient_buffer = []


for i in range(1000):
# Play a bunch of episodes
    rewards = []
    states = []
    actions = []
    position_rewards= []


    for e in range(100):
        terminal = False
        state = env.reset()
        r = 0

        while not terminal:
            if  count_episodes %300==0:
                #env.render()
                pass


            h,y,p = propagate_forward(state)

            states.append(state)



            # select action for current state
            action = np.random.choice(2, p=p)
            actions.append(action)

            state, reward, terminal, info = env.step(action)
            # we want to keep the pole close to the center of the screen
            position_rewards.append(-np.abs(state[0])*2)

            r+= 1


        count_episodes+=1

        if  count_episodes %10000==0:
            print(model['W1'])
            print(model['W2'])

        rewards.append(r)

    print('Average episode lasted: ',np.mean(rewards))
    print('Number of episodes played: ',count_episodes)
    rs = [list(reversed(gamma_series[:r])) for r in rewards]
    mean_r = np.mean([np.mean(r) for r in rs])
    rs = np.concatenate(rs) + np.array(position_rewards)
    normalized_rewards = (rs - np.mean(rs))/np.std(rs)


    # train

    ind = np.arange(len(rs))
    np.random.shuffle(ind)
    for t in ind:
        reward = normalized_rewards[t]
        state = states[t]
        action = actions[t]

        h,y,p = propagate_forward(state)
        dy = policy_gradient(action,reward, y, p)
        dw1, dw2 = backprop(dy, h, state)

        #update weights (train)
        w1_update = np.clip(learning_rate * dw1, -0.0005, +0.0005)
        w2_update = np.clip(learning_rate * dw2, -0.0005, +0.0005)

        model['W1'] += w1_update
        model['W2'] += w2_update

    #gradient_buffer.append([np.linalg.norm(learning_rate * dw1), np.linalg.norm(learning_rate * dw2)])
    #gradient_buffer.append([np.max(np.abs(learning_rate * dw1)), np.max(np.abs(learning_rate * dw2))])
    #print(np.linalg.norm(dw1))
    #print(np.linalg.norm(dw2))


Average episode lasted:  22.8
Number of episodes played:  100
Average episode lasted:  22.79
Number of episodes played:  200
Average episode lasted:  20.93
Number of episodes played:  300
Average episode lasted:  21.82
Number of episodes played:  400
Average episode lasted:  19.57
Number of episodes played:  500
Average episode lasted:  20.67
Number of episodes played:  600
Average episode lasted:  23.45
Number of episodes played:  700
Average episode lasted:  20.91
Number of episodes played:  800
Average episode lasted:  21.9
Number of episodes played:  900
Average episode lasted:  22.17
Number of episodes played:  1000
Average episode lasted:  21.77
Number of episodes played:  1100
Average episode lasted:  24.32
Number of episodes played:  1200
Average episode lasted:  21.06
Number of episodes played:  1300
Average episode lasted:  20.48
Number of episodes played:  1400
Average episode lasted:  22.24
Number of episodes played:  1500
Average episode lasted:  22.35
Number of episodes p

KeyboardInterrupt: 