### About Xavier initialization
The aim of weight initialization is to prevent layer activation outputs from exploding or vanishing during the course of a forward pass through a deep neural network.

If either occurs, loss gradients will either be too large or too small to flow backwards beneficially, and the network will take longer to converge, if it is even able to do so at all.

#### Definition

Xavier initialization sets a layer’s weights to values chosen from a random uniform distribution that’s bounded between:
    sqrt(6) / sqrt(n_i + n_o)
    
#### Even better?

In their 2015 paper, He et. al. demonstrated that deep networks (e.g. a 22-layer CNN) would converge much earlier if the following input weight initialization strategy is employed:

  - Create a tensor with the dimensions appropriate for a weight matrix at a given layer, and populate it with numbers randomly chosen from a standard normal distribution.
  - Multiply each randomly chosen number by √2/√n where n is the number of incoming connections coming into a given layer from the previous layer’s output (also known as the “fan-in”).
  - Bias tensors are initialized to zero.

# 1. Define environment and model

In [1]:
import random
import gym
import numpy as np
from collections import deque

# create cart pole environment
ENV_NAME = "CartPole-v1"
env = gym.make(ENV_NAME)

# get main parameters from the agent-environment interaction

number_of_actions = env.action_space.n
state_dim = env.observation_space.shape[0]



In [2]:
# Model parameters
H = 8 
A = number_of_actions
S = state_dim

# Model weights
model = {}
model['W1'] = np.random.randn(H,S) / np.sqrt(S) # "Xavier" initialization
model['W2'] = np.random.randn(A,H) / np.sqrt(H)

# 2. Define useful functions for training and evaluating the network

In [12]:
def softmax(y):
    return np.exp(y)/np.sum(np.exp(y))

def propagate_forward(x):
    global model
    h = np.dot(model['W1'], x)
    h[h<0] = 0
    y = np.dot(model['W2'], h)
    p = softmax(y)

    return h,y,p

def policy_gradient(action, reward, y, p):
    grad = -p
    grad[action] += 1.
    return grad*reward


In [13]:
# sample input
x = np.array([-0.18903734, -1.75218701,  0.21354577,  2.74911676])

h,y,p = propagate_forward(x)
dy = policy_gradient(0,1, y, p)

## 2.1 Backprop

In [14]:
# we are lucky that as we have dense layers, the gradient is linear

# W2 update
dw2 = np.outer(dy, h)
## that means that if we ONLY update W2 += dw2, now forward propagating
## the same input x will yield the 'perfect' policy value for this example (proof below)
## W2 ready :)



# W1 update
dh = np.dot(dy, model['W2'])
dh[h==0] = 0
## now we have how much we need to increase the hidden state: h = h+lr*dh
dw1 = np.outer(dh, x)
## W1 ready :)


#### BACKPROP Tests: 
#### 1. After only updating second layer weights and evaluating policy gradient again


In [15]:
print(model['W2'])
h,y,p = propagate_forward(x)
dy = policy_gradient(0,1, y, p)
print(dy)

# W2 update
dw2 = np.outer(dy, h)
model['W2'] += dw2

h,y,p = propagate_forward(x)
dy = policy_gradient(0,1, y, p)
print(dy)

[[-1.50281072 -2.51491859  1.45652641 -0.58305546  0.67784589 -0.55232407
  -0.06772914 -1.20852512]
 [ 1.6629324   2.69174192 -1.36031205  0.92222555 -1.04027333  0.42643155
  -0.45370185  1.2694156 ]]
[ 1. -1.]
[ 0.00000000e+00 -5.39829682e-55]


##### The policy gradient is 0 :)


##### Of course, this update is crazy, as you force to perfectly fit the optimal policy for the current example

#### 2. After only updating W1 and evaluating policy gradient again

In [16]:
model = {}
model['W1'] = np.random.randn(H,S) / np.sqrt(S) # "Xavier" initialization
model['W2'] = np.random.randn(A,H) / np.sqrt(H)

h,y,p = propagate_forward(x)
dy = policy_gradient(0,1, y, p)

print(dy)

# W1 update
dh = np.dot(dy, model['W2'])
dh[h==0] = 0
## now we have how much we need to increase the hidden state: h = h+lr*dh
dw1 = np.outer(dh, x)
## W1 ready :)

model['W1'] += dw1

h,y,p = propagate_forward(x)
dy = policy_gradient(0,1, y, p)
print(dy)

[ 0.11081384 -0.11081384]
[ 0.02402039 -0.02402039]


#### The policy gradient is not 0 because the RELU keeps the necessary neurons from activating


### Anyway, the final backprop code


In [17]:
def backprop(dy, h, x):
    global model
    # we are lucky that as we have dense layers, the gradient is linear

    # W2 update
    dw2 = np.outer(dy, h)
    ## that means that if we ONLY update W2 += dw2, now forward propagating
    ## the same input x will yield the 'perfect' policy value for this example (proof below)
    ## W2 ready :)

    # W1 update
    dh = np.dot(dy, model['W2'])
    dh[h<=0] = 0
    ## now we have how much we need to increase the hidden state: h = h+lr*dh
    dw1 = np.outer(dh, x)
    ## W1 ready :)
    return dw1, dw2

## 2.2 Rewards
As in Karpathy's Pong from Pixels, we normalize the rewards obtained in each batch

In [18]:
gamma = 0.95
gamma_series = [1]
for e in range(1000):
    gamma_series.append(gamma_series[-1]*gamma + 1)

# 3. Train the model!

In [24]:
model = {}
model['W1'] = np.random.randn(H,S) / np.sqrt(S*H) # "Xavier" initialization
model['W2'] = np.random.randn(A,H) / np.sqrt(H*A)

learning_rate = 0.001

for i in range(10):
# Play a bunch of episodes
    rewards = []
    states = []
    actions = []



    for e in range(500):
        terminal = False
        state = env.reset()
        r = 0

        while not terminal:

            h,y,p = propagate_forward(state)

            states.append(state)



            # select action for current state
            action = np.random.choice(2, p=p)
            actions.append(action)

            state, reward, terminal, info = env.step(action)

            r+= 1


        rewards.append(r)
    print('Average episode lasted ',np.mean(rewards))

    rs = [list(reversed(gamma_series[:r])) for r in rewards]
    mean_r = np.mean([np.mean(r) for r in rs])
    rs = np.concatenate(rs)
    normalized_rewards = (rs - np.mean(rs))/np.std(rs)


    # train

    ind = np.arange(len(rs))
    np.random.shuffle(ind)
    for t in ind:
        reward = normalized_rewards[t]
        state = states[t]
        action = actions[t]

        h,y,p = propagate_forward(state)
        dy = policy_gradient(action,reward, y, p)
        dw1, dw2 = backprop(dy, h, state)
        
        #update weights (train)
        
        model['W1'] += np.clip(learning_rate * dw1, -0.0005, +0.0005)
        model['W2'] += np.clip(learning_rate * dw2, -0.0005, +0.0005)




Average episode lasted  21.006
Average episode lasted  26.03
Average episode lasted  36.168
Average episode lasted  52.74
Average episode lasted  84.686
Average episode lasted  140.126
Average episode lasted  95.25
Average episode lasted  196.574
Average episode lasted  352.36
Average episode lasted  252.646


In [25]:
# save model
import pickle
with open('model.pickle', 'wb') as f:
    pickle.dump(model, f)


# See the model play

In [29]:
# see it play
!python see_the_model_play.py