# Actor Critic Methods

Actor Critic Methods combine both Value Based and Policy Gradient Methods.

##### Value Based Methods:
- Maps each state action pair to a value.
- Take action with highest value when exploiting.
- Works for finite set of actions.
- eg Q Learning, DQN

$
\begin{align}
Q(s, a) = Q(s, a) + \alpha [r + \gamma max_{a^{'}}Q(s^{'}, a^{'}) - Q(s, a)]
\end{align}
$



##### Policy Gradient Methods:
- Directly optimize the policy given a state.
- Maps state to action distribution.
- Useful when large number of possible actions.
- eg Policy Gradients


$
\begin{align}
\bigtriangledown \text{J}(\theta) = E_{\pi}[ \bigtriangledown \log(\pi(\tau)) r(\tau) ]
\end{align}
$


Issues with Policy Gradients
- High variance due to different trajectories being taken from the same step in different episodes.
- No Learning takes place when cumulative reward is 0
- Monte Carlo Method so have to wait until end of trajectory to process.

Instead of using the discounted future reward of the trajectory we can use an estimate of the Q Value.
This can be done by running two networks in parallel. One to learn the Q value and the second that uses policy gradients to learn the action probability distribution.

$
\begin{align}
\bigtriangledown \text{J}(\theta) = E_{\pi}[ \bigtriangledown \log(\pi(\tau)) Q(s_{t}, a_{t}) ]
\end{align}
$

Since we can plug in the Q Value we reduce the variance as well as allow us to use a temporal difference method instead of Monte Carlo.

Using Advantage
$
\begin{align}
A(s_{t}, a_{t}) = Q(s_{t}, a_{t}) - V(s_{t})
\end{align}
$

$
\begin{align}
A(s_{t}, a_{t}) = r_{t+1} + \gamma V(s_{t+1}) - V(s_{t})
\end{align}
$


$
\begin{align}
\bigtriangledown \text{J}(\theta) = E_{\pi}[ \bigtriangledown \log(\pi(\tau)) A(s_{t}, a_{t}) ]
\end{align}
$

Here we take the Q value for a state minus the average Value for the state. We can think of it as how much better or worse the current action is compared with the rest of the actions.


In [1]:
import gym
import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import sys

### Building the Network

In [2]:
class ActorCritic:
    def __init__(self, sess):
        self.sess = sess
        with tf.variable_scope("a2c", reuse=tf.AUTO_REUSE):
            self.inputs = tf.placeholder(tf.float32, [None, 4], name="inputs")
            
            self.fc1 = tf.layers.dense(inputs = self.inputs,
                                      units = 64,
                                      activation = tf.nn.relu,
                                      kernel_initializer=tf.contrib.layers.xavier_initializer())
            self.fc2 = tf.layers.dense(inputs = self.fc1,
                                      units = 128,
                                      activation = tf.nn.relu,
                                      kernel_initializer=tf.contrib.layers.xavier_initializer())
            # Instead of having two different networks we can also share the first few layers for both the Actor and Critic.
            # The layer fc3 will be used as the input to the final layers for the actor and critic.
            self.fc3 = tf.layers.dense(inputs = self.fc2,
                                      units = 64,
                                      activation = tf.nn.relu,
                                      kernel_initializer=tf.contrib.layers.xavier_initializer())
            # Actor
            # Return the Probability Distribution for each action.
            self.action_prob = tf.layers.dense(inputs = self.fc3,
                                              units = 2,
                                              activation = tf.nn.softmax,
                                              kernel_initializer=tf.contrib.layers.xavier_initializer())
            
            #Critic
            # Single node to output the Value Function
            self.state_value = tf.layers.dense(inputs = self.fc3,
                                              units = 1,
                                              activation = None,
                                              kernel_initializer=tf.contrib.layers.xavier_initializer())
            
            self.rewards = tf.placeholder(tf.float32, [None, 1], name="rewards")
            self.actions = tf.placeholder(tf.float32, [None, 2], name="actions")
            self.state_values_est = tf.placeholder(tf.float32, [None, 1], name="value_estimates")
            
            #  Calculate the Log of the probabilities.
            self.action_log_probs = tf.log(self.action_prob)
            
            
            self.chosen_action_log_probs = tf.multiply(self.actions, self.action_log_probs)
            
            # Calculate TD Error
            self.advantages = self.rewards - self.state_values_est
            self.value_loss = tf.reduce_mean(tf.pow(self.advantages, 2))
            
            # Add entropy to increase exploration,
            self.entropy = tf.reduce_mean(tf.multiply(self.action_prob, self.action_log_probs))
            self.action_gain = tf.reduce_mean(tf.multiply(self.chosen_action_log_probs, self.advantages))
            
            self.total_loss = self.value_loss - self.action_gain - 0.0001*self.entropy
            self.optimizer = tf.train.AdamOptimizer(0.0001).minimize(self.total_loss)
            
    # Only Actor
    def get_action(self, sess, state):
        action_prob = sess.run([self.action_prob], {self.inputs:state})
        action_prob = action_prob[0]
        action = np.random.choice(action_prob.shape[1], action_prob.shape[0], p=action_prob[0].ravel())
        action = np.asscalar(action)
        return action
    
    # Only Critic
    def get_state_value(self, sess, state):
        value = sess.run([self.state_value], {self.inputs:state})
        return value
    
    # Both Actor and Critic
    def evaluate_actions(self, state):
        action_prob, value = sess.run([self.action_prob, self.state_value], {self.inputs:state})
        return action_prob, state_values

In [3]:
GAMMA = .95
LEARNING_RATE = 0.01
N_GAMES = 2000
N_STEPS = 20
env = gym.make('CartPole-v0')

In [4]:
def get_discounted_rewards(model, rewards, dones):
    
    R = []
    rewards.reverse()

    # If we happen to end the set on a terminal state, set next return to zero
    if dones[-1] == True: 
        next_return = 0
        
    # If not terminal state, bootstrap v(s) using our critic
    # TODO: don't need to estimate again, just take from last value of v(s) estimates
    else: 
        value = model.get_state_value(model.sess, states[-1])
        next_return = value[0][0]
    
    # Backup from last state to calculate "true" returns for each state in the set
    R.append(next_return)
    dones.reverse()
    for r in range(1, len(rewards)):
        if not dones[r]: 
            this_return = rewards[r] + next_return * GAMMA
        else:
            this_return = 0
        R.append(this_return)
        next_return = this_return

    R.reverse()
    return R

In [5]:
def reflect(model, states, actions, rewards, dones):
    states = np.reshape(states, [-1, 4])
    discounted_rewards = get_discounted_rewards(model, rewards, dones)
    discounted_rewards = np.reshape(discounted_rewards, [-1, 1])
    actions = np.reshape(actions, [-1, 1])
    actions = np.eye(2)[actions]
    actions = np.reshape(actions, [-1, 2])
    state_value_estimates = model.get_state_value(sess, states)
    state_value_estimates = np.reshape(state_value_estimates, [-1, 1])
    
    sess.run([model.total_loss, model.optimizer], {
        model.inputs: states,
        model.rewards: discounted_rewards,
        model.actions: actions,
        model.state_values_est: state_value_estimates
    })

### Train Agent

In [6]:
state = env.reset()
finished_games = 0

sess = tf.Session()
model = ActorCritic(sess)
sess.run(tf.global_variables_initializer())
total_reward = 0
state_size = env.observation_space.shape[0]
while finished_games < N_GAMES:
    states, actions, rewards, dones = [], [], [], []
    # Gather training data
    for i in range(N_STEPS):
        state = np.reshape(state, [-1, state_size])
        action = model.get_action(sess, state)

        next_state, reward, done, _ = env.step(action)

        states.append(state)
        actions.append(action)
        rewards.append(reward)
        dones.append(done)
        total_reward += reward

        if done: 
            state = env.reset()
            finished_games += 1
            if finished_games % 50 == 0:
                print("Games Finished", finished_games, "total_reward", total_reward)
            total_reward = 0
        else:
            state = next_state

    # Reflect on training data
    reflect(model, states, actions, rewards, dones)

Games Finished 50 total_reward 18.0
Games Finished 100 total_reward 20.0
Games Finished 150 total_reward 74.0
Games Finished 200 total_reward 18.0
Games Finished 250 total_reward 12.0
Games Finished 300 total_reward 45.0
Games Finished 350 total_reward 39.0
Games Finished 400 total_reward 60.0
Games Finished 450 total_reward 69.0
Games Finished 500 total_reward 18.0
Games Finished 550 total_reward 20.0
Games Finished 600 total_reward 35.0
Games Finished 650 total_reward 36.0
Games Finished 700 total_reward 34.0
Games Finished 750 total_reward 119.0
Games Finished 800 total_reward 55.0
Games Finished 850 total_reward 86.0
Games Finished 900 total_reward 51.0
Games Finished 950 total_reward 97.0
Games Finished 1000 total_reward 88.0
Games Finished 1050 total_reward 18.0
Games Finished 1100 total_reward 83.0
Games Finished 1150 total_reward 45.0
Games Finished 1200 total_reward 106.0
Games Finished 1250 total_reward 102.0
Games Finished 1300 total_reward 29.0
Games Finished 1350 total_rew

### Evaluate agents performance

In [7]:
episodes = 100

total_reward = 0
for _ in range(episodes):
    state = env.reset()
    done = False
    while not done:
        action_probability_distribution = sess.run(model.action_prob, feed_dict = {
            model.inputs : state.reshape([-1, state_size])
        })
        action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())
        state, reward, done, _ = env.step(action)
        total_reward += reward

print(f"Results after {episodes} episodes:")
print(f"Average Reward per episode: {total_reward / episodes}")

Results after 100 episodes:
Average Reward per episode: 177.18


In [8]:
sess.close()