<img src="images/tf_a3c.png" align=right width=40%></img>
# TensorFlow Implementation of A3C
Author: Jin Yeom (jinyeom@utexas.edu)

## Contents
- [Actor-Critic](#Actor-Critic)
- [Worker](#Worker)
- [Coffee break](#Coffee-break)
- [References](#References)

In [1]:
import numpy as np
import tensorflow as tf

In this notebook, we implement the **Asynchronous Advantage Actor-Critic (A3C)** algorithm with **[TensorFlow](https://www.tensorflow.org/)**. The focus of this project is to (1) learn how to work with TensorFlow in more depth, and (2) have my own working implementation of A3C algorithm for other current or future reinforcement learning projects. For the simplicity of this notebook, we will aim to build an agent to play an Atari 2600 game, **Pong**. 

Note that **A2C** (synchronous variation of A3C) has been shown to be more cost effective when trained on a single GPU, or TPU, for that matter. We will first try our A3C implementation on a *multi-core CPU*, then A2C on a *TPU (Tensor Processing Unit)* to compare their costs and performances.

## Actor-Critic

In [2]:
class ActorCriticLSTMPolicy(object):
    def __init__(self, num_actions, name):
        with tf.variable_scope(name):
            self.observation = tf.placeholder(tf.float32, shape=[None, 84, 84, 1], name='observation')
    
            z1 = tf.layers.conv2d(self.observation, 16, 8, strides=(4, 4), activation=tf.nn.relu)
            z2 = tf.layers.conv2d(z1, 32, 4, strides=(2, 2), activation=tf.nn.relu) 
            z3 = tf.layers.dense(tf.layers.flatten(z2), 256, activation=tf.nn.relu)
            z3 = tf.expand_dims(z3, axis=1) # [batch_size, max_time, 256]

            lstm_cell = tf.nn.rnn_cell.LSTMCell(256)
            initial_state = lstm_cell.zero_state(1, dtype=tf.float32)
            z4, self.hidden = tf.nn.dynamic_rnn(lstm_cell, z3, initial_state=initial_state)
            z4 = tf.reshape(z4, [-1, 256])

            self.actor = tf.layers.dense(z4, num_actions, activation=tf.nn.softmax)
            self.critic = tf.layers.dense(z4, 1)
            
    def __call__(self, sess, observation):
        feed_dict = {self.observation: observation}
        action_probs, value = sess.run([self.actor, self.critic], feed_dict=feed_dict)
        return action_probs, value

Let's quickly test it before we move on!

In [3]:
ac_lstm_policy = ActorCriticLSTMPolicy(4, name='test')
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for _ in range(10):
        observation = np.random.normal(loc=0.0, scale=0.1, size=(1, 84, 84, 1))
        p, v = ac_lstm_policy(sess, observation)
        print(f"p = {p}, v = {v}")

p = [[0.2500289  0.24956998 0.24932864 0.25107247]], v = [[0.0030722]]
p = [[0.25029135 0.24919741 0.24967816 0.25083306]], v = [[0.00383127]]
p = [[0.2507164  0.24996355 0.24951242 0.24980763]], v = [[-0.0002115]]
p = [[0.25096047 0.24900228 0.25008363 0.24995358]], v = [[0.00470447]]
p = [[0.25009653 0.24965948 0.24969296 0.25055102]], v = [[-0.00051197]]
p = [[0.25060332 0.24939239 0.2492773  0.25072697]], v = [[0.00434569]]
p = [[0.25048476 0.24971694 0.24918285 0.2506155 ]], v = [[0.00751681]]
p = [[0.2502596  0.25017738 0.24922073 0.25034225]], v = [[0.00192816]]
p = [[0.24972083 0.25049695 0.24911007 0.25067207]], v = [[0.00181042]]
p = [[0.25010926 0.2501092  0.24939291 0.25038865]], v = [[0.00428144]]


This time, we're going to build actor networks that will actually interact with their copies of the environment and update their parameters.

In [None]:
class ActorPolicy(ActorCriticLSTMPolicy):
    def __init__(self, num_actions, name):
        super(ActorPolicy, self).__init__(num_actions, name)
        with tf.variable_scope(name):
            # TODO: add members needed for updating

## Coffee break

In [None]:
with tf.device("/cpu:0"):
    optimizer = tf.train.AdamOptimizer(learning_rate=1e-4)
    master_network = Agent(a_size, name='global')
    num_workers = multiprocessing.cpu_count()
    workers = []
    # Create worker classes
    for i in range(num_workers):
        workers.append(Worker(DoomGame(),i,s_size,a_size,trainer,model_path,global_episodes))
    saver = tf.train.Saver(max_to_keep=5)

## References