<img src="images/tf_a3c.png" align=right width=40%></img>
# TensorFlow Implementation of A3C
Author: Jin Yeom (jinyeom@utexas.edu)

## Contents
- [Actor-Critic](#Actor-Critic)
- [Worker](#Worker)
- [Coffee break](#Coffee-break)
- [References](#References)

In [1]:
import tensorflow as tf

In this notebook, we implement the **Asynchronous Advantage Actor-Critic (A3C)** algorithm with **[TensorFlow](https://www.tensorflow.org/)**. The focus of this project is to (1) learn how to work with TensorFlow in more depth, and (2) have my own working implementation of A3C algorithm for other current or future reinforcement learning projects. For the simplicity of this notebook, we will aim to build an agent to play an Atari 2600 game, **Pong**. 

Note that **A2C** (synchronous variation of A3C) has been shown to be more cost effective when trained on a single GPU, or TPU, for that matter. We will first try our A3C implementation on a *multi-core CPU*, then A2C on a *TPU (Tensor Processing Unit)* to compare their costs and performances.

## Actor-Critic

In [2]:
def actor_critic_lstm(observation, num_actions):
    z1 = tf.layers.conv2d(observation, 16, 8, strides=(4, 4), activation=tf.nn.relu)
    z2 = tf.layers.conv2d(z1, 32, 4, strides=(2, 2), activation=tf.nn.relu) 
    z3 = tf.layers.dense(tf.layers.flatten(z2), 256, activation=tf.nn.relu)
    z3 = tf.expand_dims(z3, axis=1) # [batch_size, max_time, 256]
    
    lstm_cell = tf.nn.rnn_cell.LSTMCell(256)
    initial_state = lstm_cell.zero_state(1, dtype=tf.float32)
    z4, h = tf.nn.dynamic_rnn(lstm_cell, z3, initial_state=initial_state)
    z4 = tf.reshape(z4, [-1, 256])
    
    actor = tf.layers.dense(z4, num_actions, activation=tf.nn.softmax)
    critic = tf.layers.dense(z4, 1)
    return actor, critic

Let's quickly test it before we move on!

In [3]:
actor, critic = actor_critic_lstm(tf.random_normal([1, 84, 84, 1]), 4)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for _ in range(10):
        p, v = sess.run([actor, critic])
        print(f"p = {p}, v = {v}")

p = [[0.2549999  0.24273217 0.25047782 0.25179014]], v = [[0.01526831]]
p = [[0.2595103  0.24091478 0.24680299 0.25277194]], v = [[-0.0019261]]
p = [[0.2620512  0.2418183  0.24685757 0.24927294]], v = [[-0.00755529]]
p = [[0.2520923  0.24061431 0.25139946 0.25589398]], v = [[0.00600095]]
p = [[0.25361642 0.24089344 0.24699068 0.25849947]], v = [[0.00095484]]
p = [[0.25879407 0.23946841 0.241341   0.26039648]], v = [[0.03741891]]
p = [[0.25741002 0.24537227 0.24542364 0.25179407]], v = [[0.00833888]]
p = [[0.26078522 0.24249677 0.24611577 0.25060225]], v = [[0.02895084]]
p = [[0.25691882 0.24525094 0.24387167 0.25395858]], v = [[0.0146306]]
p = [[0.261409   0.24053068 0.23811433 0.25994602]], v = [[0.02571743]]


In [None]:
class Agent(object):
    def __init__(self, num_actions, name):
        with tf.variable_scope(name):
            self.observation = tf.placeholder(tf.float32, shape=[None, 84, 84, 1], name='observation')
            self.action_prob, self.value = actor_critic_lstm(self.observation, num_actions)

## Coffee break

In [None]:
with tf.device("/cpu:0"):
    optimizer = tf.train.AdamOptimizer(learning_rate=1e-4)
    master_network = Agent(a_size, name='global')
    num_workers = multiprocessing.cpu_count()
    workers = []
    # Create worker classes
    for i in range(num_workers):
        workers.append(Worker(DoomGame(),i,s_size,a_size,trainer,model_path,global_episodes))
    saver = tf.train.Saver(max_to_keep=5)

## References