# Actor Critic Model

## Initial setup

In [3]:
import sys
sys.path.insert(0, "../python")
print(sys.path)

['../python', '', '/home/james/anaconda3/envs/vizdoom/lib/python36.zip', '/home/james/anaconda3/envs/vizdoom/lib/python3.6', '/home/james/anaconda3/envs/vizdoom/lib/python3.6/lib-dynload', '/home/james/.local/lib/python3.6/site-packages', '/home/james/anaconda3/envs/vizdoom/lib/python3.6/site-packages', '/home/james/anaconda3/envs/vizdoom/lib/python3.6/site-packages/Sphinx-1.5.4-py3.6.egg', '/home/james/anaconda3/envs/vizdoom/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg', '/home/james/anaconda3/envs/vizdoom/lib/python3.6/site-packages/IPython/extensions', '/home/james/.ipython']


In [4]:
from vizdoom import *
from helper import create_agent
import tensorflow as tf

In [5]:
# If running other experiments on GPUs
import os
os.environ["CUDA_VISIBLE_DEVICES"]=""

In [6]:
%load_ext autoreload
%autoreload 2

## Create agent

In [7]:
# Initializes DoomGame from config file
def initialize_vizdoom(config_file):
    game = DoomGame()
    game.load_config(config_file)
    game.init()
    return game  

In [8]:
# Initialize agent and TensorFlow graph
def make_new_agent():
    tf.reset_default_graph()
    agent_file_path = "./actor_critic/ac.json"
    config_file_path = "./actor_critic/ac.cfg"
    results_dir = "../tmp/tmp_results/"
    action_set = "basic_three"
    game = initialize_vizdoom(config_file_path)
    return create_agent(agent_file_path,
                        game=game, 
                        action_set=action_set,
                        output_directory=results_dir)

## N-step learning
First, we will walk through the n-step learning process.

In [167]:
# Prints status of memory buffers
def print_agent_status():
    print("s1_buffer:    \n", agent.s1_buffer[:, :5, 0, -1])
    print("a_buffer:     \n", agent.a_buffer)
    print("s2_buffer:    \n", agent.s2_buffer[:, :5, 0, -1])
    print("r_buffer:     \n", agent.r_buffer)
    print("gamma_buffer: \n", agent.gamma_buffer)
    print("memory r:     \n", agent.memory.r)

In [137]:
# View memory storage
agent = make_new_agent()
agent.initialize_new_episode()
for i in range(5):
    print("Step %d: " % (i+1))
    agent.perform_learning_step(1, 1)
    print_agent_status()
    print()
print("V(s): ", agent.network.get_value_output(agent.s2_buffer[-1]))

Step 1: 
s1_buffer:    
 [[ 0.47058824  0.46568626  0.47058824  0.47058824  0.41549021]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]]
a_buffer:     
 [2 0 0 0 0]
s2_buffer:    
 [[ 0.47058824  0.46568626  0.47058824  0.47058824  0.41549021]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]]
r_buffer:     
 [-0.04  0.    0.    0.    0.  ]
gamma_buffer: 
 [ 1.          0.99        0.9801      0.970299    0.96059601]
memory r:     
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

Step 2: 
s1_buffer:    
 [[ 0.47058824  0.46568626  0.47058824  0.47058824  0.41549021]
 [ 0.47058824  0.46568626 

Initially, during the first $n$ steps, the agent stores transitions consisting of $s1, a, s2, r$. After the $n$th step, the return $R$ can be calculated for time $t-n$. The first $Q(s,a) \approx \mathbb{E}[R_t]$ is given by:

$\sum_{i=0}^{k-1}(\gamma^ir_i)+\gamma^kV(s_t)$

which in this case is vectorized:

$
\begin{bmatrix} 
1.0
\\ 0.99
\\ 0.9801
\\ 0.970299
\\ 0.96059601
\end{bmatrix}
\cdot
\begin{bmatrix} 
-0.04
\\ -0.04
\\ -0.04
\\ -0.04
\\ -0.04
\end{bmatrix}
+ 0.99^5V(s_5) = -0.196 + (0.951)(0.228338) = 0.0211$

which matches the initial return placed in memory. The $s1, s2, a, r$ buffers roll to the beginning, placing the next transition at slot 0, while the $\gamma$ buffer rolls forward by one slot to match the new configuration. Now for the next five steps, we should see a transition added after each learning step, updating the $r$ array in memory as calculated above.

In [138]:
# Now view adding transition to memory
for i in range(5):
    print("Step %d: " % (i+6))
    agent.perform_learning_step(1, 1)
    print_agent_status()
    print("V(s): ", agent.network.get_value_output(agent.s2_buffer[i]), "\n")

Step 6: 
s1_buffer:    
 [[ 0.4509804   0.37450981  0.41274509  0.62352943  0.54509807]
 [ 0.47058824  0.46568626  0.47058824  0.47058824  0.41549021]
 [ 0.47058824  0.46568626  0.47058824  0.47058824  0.41549021]
 [ 0.49990195  0.4509804   0.47058824  0.44549018  0.40000001]
 [ 0.49225491  0.43235293  0.41127452  0.49117646  0.59019607]]
a_buffer:     
 [1 2 2 1 1]
s2_buffer:    
 [[ 0.49225491  0.43235293  0.41127452  0.49117646  0.59019607]
 [ 0.47058824  0.46568626  0.47058824  0.47058824  0.41549021]
 [ 0.49990195  0.4509804   0.47058824  0.44549018  0.40000001]
 [ 0.49225491  0.43235293  0.41127452  0.49117646  0.59019607]
 [ 0.4509804   0.37450981  0.41274509  0.62352943  0.54509807]]
r_buffer:     
 [-0.04 -0.04 -0.04 -0.04 -0.04]
gamma_buffer: 
 [ 0.970299    0.96059601  1.          0.99        0.9801    ]
memory r:     
 [ 0.02110773  0.03489357  0.          0.          0.          0.          0.
  0.          0.          0.        ]
V(s):  [[ 0.24283469]] 

Step 7: 
s1_buffe

For the next four transitions, the returns are calculated as above. For example, in step 6, the return is:

$R_2 = -1.96 + (0.951)(0.242835) = 0.0349$

Note that the $\gamma$ buffer is displayed after it has been updated; thus, the buffer shown for the previous step represents the buffer used in the current calculation. Additionally, because the episode times out after 10 actions (40 tics), during step 10, the agent encounters a terminal state in s2; thus after calculating the last $n$ transitions, the buffers are reset, which is displayed in step 10. The returns for the last $n$ states are given as:

$R_{T-i}=\sum_{i=0}^{n} \gamma^i r_{t-i}$

or recursively:

$R_t \leftarrow r_t + \gamma R_{t+1}$

which in this case, because all rewards were $-0.04$, is simply:

$R_{T-i}=\sum_{i=0}^{n} \gamma^i (-0.04) = (-0.04)\sum_{i=0}^{n} \gamma^i$

Due to the recursive formula, the transitions are added in reverse order, as seen in the last 5 slots of the memory matrix $r$ above.

## Storing transitions into memory
Now that we know the agent is properly calculating the n-step return of states, we need to make sure that the other variables ($s1, a, s2, isterminal$) are being properly associated with these returns in memory

In [157]:
def print_memory_status():
    print("s1: ", agent.memory.s1[:, :5, 0, -1]) # due to storage of overlapping states
    print("a:  ", agent.memory.a)
    print("s2: ", agent.memory.s2[:, :5, 0, 0])
    print("R:  ", agent.memory.r)
    print("isterminal: ", agent.memory.isterminal)

In [168]:
# Load memory with transitions
agent = make_new_agent()
agent.initialize_new_episode()
print(agent.memory.s1.shape)
print(agent.memory.s2.shape)
for i in range(5):
    agent.perform_learning_step(1, 1)

# Compare current status with replay memory
print_agent_status()
print_memory_status()

(10, 30, 45, 4)
(10, 30, 45, 1)
s1_buffer:    
 [[ 0.46557733  0.48518518  0.50230938  0.47058824  0.42763618]
 [ 0.4559913   0.47058824  0.47058824  0.45254901  0.43845317]
 [ 0.40000001  0.40000001  0.40000001  0.40000001  0.40000001]
 [ 0.48007625  0.51218957  0.52156866  0.47058824  0.34901962]
 [ 0.39523965  0.41459695  0.40293029  0.36862746  0.39198259]]
a_buffer:     
 [0 1 2 1 2]
s2_buffer:    
 [[ 0.4559913   0.47058824  0.47058824  0.45254901  0.43845317]
 [ 0.40000001  0.40000001  0.40000001  0.40000001  0.40000001]
 [ 0.48007625  0.51218957  0.52156866  0.47058824  0.34901962]
 [ 0.39523965  0.41459695  0.40293029  0.36862746  0.39198259]
 [ 0.52156866  0.55385619  0.47503269  0.4509804   0.40000001]]
r_buffer:     
 [-0.04 -0.04 -0.04 -0.04 -0.04]
gamma_buffer: 
 [ 0.96059601  1.          0.99        0.9801      0.970299  ]
memory r:     
 [-0.17756607  0.          0.          0.          0.          0.          0.
  0.          0.          0.        ]
s1:  [[ 0.46557733 

Let's observe the next transitions $(t,...,T-2)$.

In [169]:
for i in range(4):
    agent.perform_learning_step(1, 1)

# Compare current status with replay memory
print_agent_status()
print_memory_status()

s1_buffer:    
 [[ 0.52156866  0.55385619  0.47503269  0.4509804   0.40000001]
 [ 0.45473856  0.51831156  0.4509804   0.4509804   0.34901962]
 [ 0.47058824  0.48962963  0.48007625  0.4516122   0.3685621 ]
 [ 0.5260784   0.52156866  0.40000001  0.52156866  0.54509807]
 [ 0.39523965  0.41459695  0.40293029  0.36862746  0.39198259]]
a_buffer:     
 [0 0 0 2 2]
s2_buffer:    
 [[ 0.45473856  0.51831156  0.4509804   0.4509804   0.34901962]
 [ 0.47058824  0.48962963  0.48007625  0.4516122   0.3685621 ]
 [ 0.5260784   0.52156866  0.40000001  0.52156866  0.54509807]
 [ 0.44149238  0.40000001  0.69411767  0.54509807  0.47058824]
 [ 0.52156866  0.55385619  0.47503269  0.4509804   0.40000001]]
r_buffer:     
 [-0.04 -0.04 -0.04 -0.04 -0.04]
gamma_buffer: 
 [ 1.          0.99        0.9801      0.970299    0.96059601]
memory r:     
 [-0.17756607 -0.16151434 -0.11614615 -0.15517615 -0.1869307   0.          0.
  0.          0.          0.        ]
s1:  [[ 0.46557733  0.48518518  0.50230938  0.47058

And finally to the terminal step:

In [170]:
# Perform terminal learning step
agent.perform_learning_step(1, 1)
print_agent_status()
print_memory_status()

s1_buffer:    
 [[ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]
a_buffer:     
 [0 0 0 0 0]
s2_buffer:    
 [[ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]
r_buffer:     
 [ 0.  0.  0.  0.  0.]
gamma_buffer: 
 [ 1.          0.99        0.9801      0.970299    0.96059601]
memory r:     
 [-0.17756607 -0.16151434 -0.11614615 -0.15517615 -0.1869307  -0.04       -0.0796
 -0.118804   -0.15761596 -0.1960398 ]
s1:  [[ 0.46557733  0.48518518  0.50230938  0.47058824  0.42763618]
 [ 0.4559913   0.47058824  0.47058824  0.45254901  0.43845317]
 [ 0.40000001  0.40000001  0.40000001  0.40000001  0.40000001]
 [ 0.48007625  0.51218957  0.52156866  0.47058824  0.34901962]
 [ 0.39523965  0.41459695  0.40293029  0.36862746  0.39198259]
 [ 0.44149238  0.40000001  0.69411767  0.54509807  0.47058824]
 [ 0.5260784   0.52156866  0.40000001  0.52156866  0.54509807]
 [ 0.470588

## Policy and value functions
Next, we will look into the policy and value functions, analyzing both their output and loss functions, as well as the gradients derived from them.

In [187]:
def print_learning_step():
    s1, a, s2, isterminal, R, w = agent.memory.get_sample(1)
    print("s1: ", s1[:, :3, 1, 1])
    print("a:  ", a)
    print("R:  ", R)
    print("w:  ", w)
    print("V:  ", agent.network.get_value_output(s1))
    print("pi: ", agent.network.get_policy_output(s1))
    loss_pi, loss_v = agent.network.learn(s1, a, R, weights=w)
    print("loss_pi: ", loss_pi, " loss_v: ", loss_v)

In [188]:
# Load memory with transitions
agent = make_new_agent()
agent.initialize_new_episode()
for i in range(10):
    agent.perform_learning_step(1, 1)
print_learning_step()

s1:  [[ 0.47058824  0.46568626  0.47058824]]
a:   [1]
R:   [-0.31398159]
w:   [ 1.]
V:   [[-0.11301116]]
pi:  [[ 0.23030756  0.27955541  0.49013704]]
loss_pi:  [[-0.26658764]]  loss_v:  0.0403891


Let's go through the loss calculations. The loss of the policy is given by:

$L_{\pi} = -log(\pi(a_t|s_t))(R_t-V(s_t)) - \beta H(\pi(s_t))$

where the entropy is $\sum_{i}-\pi(a_i|s_t)log(\pi(a_i|s_t))$. For this transition, this becomes:

$L_{\pi} = −log(0.2796)(−0.3140−(−0.1130))+(0.01)((0.2303)log(0.2303) + (0.2796)log(0.2796) + (0.4901)log(0.4901) = -0.2666$

The loss of the value function is simpler, as it just calculates the (mean) squared error:

$L_{V} = \sum_{i} (R_t - V(s_t))^2$

which in this case becomes:

$L_{V} = (-0.3140 - (-0.1130))^2 = 0.0404$

Let's redefine the previous function to now include gradients.

In [9]:
def print_learning_step():
    s1, a, s2, isterminal, R, w = agent.memory.get_sample(1)
    s1 = agent.network._check_state(s1)
    a = agent.network._check_actions(a)
    if w is None:
        w = np.ones(a.shape[0])
    print("s1: ", s1[:, :3, 1, 1])
    print("a:  ", a)
    print("R:  ", R)
    print("w:  ", w)
    print("V:  ", agent.network.get_value_output(s1))
    print("pi: ", agent.network.get_policy_output(s1))
    opt = agent.network.optimizer
    var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 
                                 scope=agent.network.scope)
    fd = {agent.network.state: s1, agent.network.actions: a, 
          agent.network.q_sa: R, agent.network.IS_weights: w}
    sess = agent.network.sess
    gvs_pi = opt.compute_gradients(agent.network.loss_pi, var_list=var_list)
    grads_pi = sess.run([g for g, v in gvs_pi], feed_dict=fd)
    gvs_v = opt.compute_gradients(agent.network.loss_v, var_list=var_list)
    gvs_v = [[g, v] for g, v in gvs_v if g is not None]
    grads_v = sess.run([g for g, v in gvs_v], feed_dict=fd)
    loss_pi, loss_v = agent.network.learn(s1, a, R, weights=w)
    print("loss_pi: ", loss_pi, " loss_v: ", loss_v)
    print("d(loss_pi)/d(pi): %s" % gvs_pi[-1][1].name[:-2], grads_pi[-1])
    print("d(loss_pi)/d(V): %s" % gvs_pi[-3][1].name[:-2], grads_pi[-3])
    print("d(loss_V)/d(V): %s" % gvs_v[-1][1].name[:-2], grads_v[-1])

In [16]:
# Load memory with transitions
agent = make_new_agent()
agent.initialize_new_episode()
for i in range(10):
    agent.perform_learning_step(1, 1)
print_learning_step()

s1:  [[ 0.4509804   0.37450981  0.41274509]]
a:   [[0 2]]
R:   [-0.04]
w:   [ 1.]
V:   [[ 0.20522182]]
pi:  [[ 0.29009074  0.33439586  0.3755134 ]]
loss_pi:  [[-0.25111637]]  loss_v:  0.0601337
d(loss_pi)/d(pi): global_network/pi/biases [-0.07155561 -0.08200891  0.15356451]
d(loss_pi)/d(V): global_network/V/biases [-0.97946107]
d(loss_V)/d(V): global_network/V/biases [ 0.49044365]


The derivatives of the loss functions are as follows:

$\dfrac{\partial L_\pi}{\partial \pi(a|s_t)} = 
\left\{\begin{matrix}
   -\dfrac{R_t - V(s_t)}{\pi(a|s_t)} + \beta (1 + log\pi(a|s_t)) \text{ if } a = a_t \\ 
   \beta (1 + log\pi(a|s_t)) \text{ if } a \neq a_t
\end{matrix}\right.$

$\dfrac{\partial L_\pi}{\partial V(s_t)} = log\pi(a_t|s_t)$

$\dfrac{\partial L_V}{\partial \pi(a|s_t)} = 0$

$\dfrac{\partial L_V}{\partial V(s_t)} = -2(R_t - V(s_t))$

Additionally, due to the softmax activation in the $\pi$ output layer, the derivatives for its biases and weights must backpropagate through the softmax function. Further, because of the summation term ($\sum_i e^f_i$) in softmax, the derivatives $\frac{\partial \pi(a_i|s_t)}{\partial w_{\pi(a_j|s_t)}}$, where $i \neq j$, are nonzero. Therefore:

$\dfrac{\partial L_\pi}{\partial w_{\pi(a_j|s_t)}}
= \dfrac{\partial L_\pi}{\partial \pi(s_t)} \dfrac{\partial \pi(s_t)}{\partial w_{\pi(a_j|s_t)}}
= \sum_i \left ( \dfrac{\partial L_\pi}{\partial \pi(a_i|s_t)} \dfrac{\partial \pi(a_i|s_t)}{\partial w_{\pi(a_j|s_t)}} \right )$,

where, for $f_k = b_{\pi(a_k|s_t)} + w_{\pi(a_k|s_t)}^{(1)}x^{(1)} + w_{\pi(a_k|s_t)}^{(2)}x^{(2)} + \cdots + w_{\pi(a_k|s_t)}^{(n)}x^{(n)}$:

$ \dfrac{\partial \pi(a_i|s_t)}{\partial b_{\pi(a_j|s_t)}}
= \dfrac{\partial}{\partial b_{\pi(a_j|s_t)}} \left ( \dfrac{e^f_i}{\sum_k e^f_k} \right )
= \dfrac{e^f_i \dfrac{\partial f_i}{\partial b_{\pi(a_j|s_t)}} \left ( \sum_k e^f_k \right ) - e^f_i \dfrac{\partial}{\partial b_{\pi(a_j|s_t)}} \left ( \sum_k e^f_k \right )}{\left ( \sum_k e^f_k \right )^2} = 
\left\{\begin{matrix}
    \dfrac{e^f_i \sum_k e^f_k - e^{2f_i}}{\left ( \sum_k e^f_k \right )^2} 
    = \dfrac{e^f_i}{\sum_k e^f_k} - \left ( \dfrac{e^f_i}{\sum_k e^f_k} \right )^2
    = \pi(a_i|s_t) - (\pi(a_i|s_t))^2
    = \pi(a_i|s_t)(1 - \pi(a_i|s_t)) \text{ if } i = j \\
    \dfrac{(0) \sum_k e^f_k - e^{f_i}e^{f_j}}{\left ( \sum_k e^f_k \right )^2}
    = \left ( \dfrac{e^f_i}{\sum_k e^f_k} \right ) \left ( \dfrac{e^f_j}{\sum_k e^f_k} \right )
    = -\pi(a_i|s_t)\pi(a_j|s_t) \text{ if } i \neq j
\end{matrix}\right.$

Similarly, the expressions for $\frac{\partial \pi(a_i|s_t)}{\partial w_{\pi(a_j|s_t)}^{(m)}}$ are the same as above except multiplied by $\frac{\partial f_j}{\partial w_{\pi(a_j|s_t)}^{(m)}} = x^{(m)}$:

$\dfrac{\partial \pi(a_i|s_t)}{\partial b_{\pi(a_j|s_t)}} =
\left\{\begin{matrix}
    = x^{(m)}\pi(a_i|s_t)(1 - \pi(a_i|s_t)) \text{ if } i = j \\
    = -x^{(m)}\pi(a_i|s_t)\pi(a_j|s_t) \text{ if } i \neq j
\end{matrix}\right.$

We can plug in the numbers above to check these values:

$\frac{\partial L_\pi}{\partial \pi(a_0|s_t)} = -\frac{-0.6363 - -0.3326}{0.4321} + (0.01)(1 + log(0.4321)) = 

# The bias is inside a softmax function!!!

$L_{Q} = \sum_{i} (r_i + \gamma Q'(s_t,a_t) - Q(s_t,a_t))^2$

In [13]:
test_session = tf.Session()
x = tf.constant([3, 2, 5])
y = tf.constant([2])
mul = x * y
test_session.run(mul)

array([ 6,  4, 10], dtype=int32)