# Chaper 15: Introduction to the Actor-Critic Method

In the last three chapters, we have learned two different methods of reinforcement learning: value-based approaches such as tabular Q learning and deep Q learning, and a policy-based approach (namely the policy gradients approach). In this chapter, we'll introduce a method that combines the two methods: the actor-critic method. 

In the actor-critic method, we simultaneously learn a policy function and a value function. Specifically, we create a neural network with one input network and two parallel output networks. The two output networks share the same input network. One of the output network, the actor, tells the agent how to make decisions: it returns the probability distribution of each possible next move. The other output network, the critic, estimate the expected game outcome. 

***
$\mathbf{\text{Create a subfolder for files in Chapter 15}}$<br>
***
We'll put all files in Chapter 15 in a subfolder /files/ch15. Run the code in the cell below to create the subfolder.

***

In [1]:
import os

os.makedirs("files/ch15", exist_ok=True)

# 1. The Idea behind the Actor-Critic Approach
This section introduces the logic hehind the actor-critic approach.

## 1.1. Pros and Cons of Policy-Based and Value-Based Approaches
We have discussed in Chapter 14 the idea behind policy gradients. In a nutshell, to maximize cumulative rewards, the agent adjusts the model parameters proportional to the product of the graidietns and rewards. The product of the gradients and reward is as follows: 
$$
E[\sum\limits_{t=0}^{T-1} \nabla _{\theta }log\pi _{\theta }(a_{t}|s_{t})r_t(s_{0},a_{0},\ldots ,a_{T-1},s_T)|\pi _{\theta }]
$$
The discounted reward agent receives at time t, $r_t(s_{0},a_{0},\ldots ,a_{T-1},s_T)$, is calculated ex post, after observing the final payoff R and the trajectory of states and actions $(s_{0},a_{0},\ldots ,a_{T-1},s_T)$. That is, the agent adjusts the parameters proportional to the product of the gradients and the rewards. The advantages of policy based approach is that it's very direct and works fast in many situations. The concern is that the gradients have high variance and the training process may become unstable. In some scenarios, the parameters doen't converge. 

The value-based approach, on the otehr hand, tries to find the value functions for each state-action pair. The agent then chooses teh nbest action in each state by picking the action with the highest value. The approach is stable but training is more time-consuming. 

The actor-critic approach combines the value-based approach and hte policy based appraoch and provides a more powerful reinforcement learning method. 

## 1.2. The Concept of Advantage
The actor-critic reinforcement learning models have two networks: a value-based network that we call the critic, and a policy-based network that we call the actor. The value network estimates the expected game outcome (that is, the value of the action produced by the actor), while the poilcy network tells the agent which actions to take. 

Instead of adjusting the parameters by the product of the gradients and the rewards, the Actor-Critic approach adjusts parmaters by the preoduct of the gradients and teh advanrtage. The advantage is defined as the difference between the reward and the expected outcome:
$$A(s_t,a_t)=r_t(s_{0},a_{0},\ldots ,a_{T-1},s_T)-V(s_t)$$

The logic is as follows: if the agent expects to win the game (i.e., the value of $V(s_t)$ is high), and the agent eventually wins the game, then the move made by the agent at time t, $a_t$, may not be that valuable. Therefore, teh agent shouldn't adjust the model parameters that much. On the other hand, if the agent expects to lose the game (i.e., the value of $V(s_t)$ is low), and the agent surprisingly wins the game, the move made by the agent at time t, $a_t$, must be a really good move.  Therefore, teh agent should adjust the model parameters so that such valuable moves are rewarded. This is a more efficient way of adjusting the model parameters to produce intelligent agents. 

# 2. Use Actor-Critic to Play the Coin Game

You have learned how to play the coin game using the actor critic approach. Specifically, we'll adjust the parameters proportional to the product of the gradients and teh advanrtage.  

### 2.1. Create the Model with Two Networks


In [2]:
from tensorflow import keras
from tensorflow.keras import layers

num_inputs = 22
num_actions = 2
# The input layer
inputs = layers.Input(shape=(22,))
# The common layer
common = layers.Dense(32, activation="relu")(inputs)
common = layers.Dense(32, activation="relu")(common)
# The policy network
action = layers.Dense(num_actions, activation="softmax")(common)
# The value network
critic = layers.Dense(1, activation="tanh")(common)
model = keras.Model(inputs=inputs, outputs=[action, critic])

The input shape is 22 since we'll convert the game state into a one-hot variable with a depth of 22 (since the number of coins left in the pile can range from 0 to 21). There are two possible actions for the agent: the agent can take either 1 or 2 coins from the pile in each turn. 

The value network and the policy network share a common network, but the model has two output networks: the action network has two neurons in it since there are two possible actions that the agent can take. We use the softmax activation function so the two values in the action network add up to one. We can interpret the two values as the probabilities of takeing actions 1 and 2, respectively. 

The value network has one neuron in it. The output is the expected outcome of the game for the current player. We use tanh activation function in the value network so the output is a value between -1 and 1. We can interpret a value of -1 as an expectation that the current palyer will lose the game, hwile a vlaue of 1 indicating that the current player will win the game. 

In [3]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
loss_func = keras.losses.MeanAbsoluteError()

## 2.2. Calculate Gradients and Discounted Rewards
We'll let the agent play against the rule-based AI that we developed in Chapter 1. For that purpose, we first define a rule_based_AI() function as follows:

In [4]:
import random

def rule_based_AI(state):
    if state%3 != 0:
        move = state%3
    else:
        move = random.choice([1,2])
    return move

We also define a onehot_encoder() function to convert the state to a one-hot variable with a depth of 22, as follows:

In [5]:
import numpy as np

def onehot_encoder(state):
    onehot=np.zeros((1,22))
    onehot[0,state]=1
    return onehot

The playing() function simulates a full game, with the rule-based AI as the first player, and the actor-critic (AC) agent as the second player, as follows:

In [6]:
from utils.coin_simple_env import coin_game
import tensorflow as tf

env=coin_game()
def playing():
    # create lists to record game history
    action_probs_history = []
    critic_value_history = []
    rewards_history = []
    state = env.reset()
    episode_reward = 0
    # Rule-based AI moves first
    state, reward, done, _ = env.step(rule_based_AI(state))
    while True:
        # convert state to onehot to feed to model
        onehot_state = onehot_encoder(state)
        # estimate action probabilities and future rewards
        action_probs, critic_value = model(onehot_state)
        # record value history
        critic_value_history.append(critic_value[0, 0])
        # select action based on the policy network
        action=np.random.choice(num_actions, p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(tf.math.log(action_probs[0, action]))
        # Apply the sampled action in our environment
        # Remember to add 1 to action since the actual actions are 1 and 2
        state, reward, done, _ = env.step(action+1)
        if done:
            # Since AC player is player 2, -1 means AC player wins
            reward = -reward
            rewards_history.append(reward)
            episode_reward += reward 
            break
        else:
            state, reward, done, _ = env.step(rule_based_AI(state))
            reward = -reward
            rewards_history.append(reward)
            episode_reward += reward                 
            if done:
                break
    return action_probs_history,critic_value_history, \
            rewards_history, episode_reward

The playing() function simulates a full game, and it records all the intermediate steps made by the AC agent. The function returns four values: 
* a list action_probs_history with the natural logorithm of the recommended probability of the action taken by the agent from the policy network; 
* a list critic_value_history with the estimated future rewards from the value network; 
* rewards_history with the rewards to each action taken by the agent in the game;
* a number episode_reward showing the total rewards to the agent during the game. 

The playing() function plays a full game and calculates the gradients and rewards.

In reinforcement learning, actions affect not only current period rewards, but also future rewards. We therefore use discounted rewards to assign credits properly as follows:

In [7]:
import numpy as np

def discount_rs(r):
    discounted_rs = np.zeros(len(r))
    running_add = 0
    for i in reversed(range(0, len(r))):
        running_add = gamma*running_add + r[i]
        discounted_rs[i] = running_add  
    return discounted_rs.tolist()

## 2.3. Update Parameters
Instead of updating model parameters after one episode, we update after a certain number of episodes to make the model stable. Here we update parameters every ten games, as follows

In [8]:
batch_size=10
def create_batch(batch_size):
    action_probs_history = []
    critic_value_history = []
    rewards_history = []
    episode_rewards = []
    for i in range(batch_size):
        aps,cvs,rs,er = playing()
        returns = discount_rs(rs)
        action_probs_history += aps
        critic_value_history += cvs
        rewards_history += returns
        episode_rewards.append(er)
    return action_probs_history,critic_value_history,\
        rewards_history,episode_rewards

In [9]:
from collections import deque

running_rewards=deque(maxlen=100)
gamma = 0.95  
episode_count = 0
# Train the model
while True:
    with tf.GradientTape() as tape:
        action_probs_history,critic_value_history,\
            rewards_history,episode_rewards=create_batch(batch_size)
        # Calculating loss values to update our network
        history = zip(action_probs_history, critic_value_history, rewards_history)
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in history:
            # Calculate actor loss
            diff = ret - value
            actor_losses.append(-log_prob * diff)
            # Calculate critic loss
            critic_losses.append(
                loss_func(tf.expand_dims(value, 0), tf.expand_dims(ret, 0))
            )
        # Adjust model parameters
        loss_value = sum(actor_losses) + sum(critic_losses)
    grads = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

    # Log details
    episode_count += batch_size
    for r in episode_rewards:
        running_rewards.append(r)
    running_reward=np.mean(np.array(running_rewards)) 
    # print out progress
    if episode_count % 100 == 0:
        template = "running reward: {:.6f} at episode {}"
        print(template.format(running_reward, episode_count))   
    # Stop if the game is solved
    if running_reward > 0.999 and episode_count>100:  
        print("Solved at episode {}!".format(episode_count))
        break
model.save("files/ch15/AC_coin.h5")

running reward: -1.000000 at episode 100
running reward: -0.980000 at episode 200
running reward: -0.920000 at episode 300
running reward: -0.960000 at episode 400
running reward: -0.900000 at episode 500
running reward: -0.880000 at episode 600
running reward: -0.260000 at episode 700
running reward: 0.800000 at episode 800
running reward: 0.900000 at episode 900
running reward: 0.960000 at episode 1000
running reward: 1.000000 at episode 1100
Solved at episode 1100!


Note here we adjust the parameters by the preduct of teh gradients and the discounted rewards. This is related to the solution to the rewards maximizatip probelm 

# 3. Play Games with the Trained Model

We'll use the trained model to play against the rule-based AI. We'll use two strategies: deterministic policy and stochastic policy. If you choose a deterministic policy, you select teh move with the highest probabiltity. On the other hand, if you choose stochastic policy, you select the moves according to the probability produced by the policy network. 

## 3.1. Deterministic Policy from the Trained Model
You play 1000 games against the rule-based AI: 

In [10]:
results=[]
for i in range(100):
    env = coin_game()
    state=env.reset()     
    while True:    
        action = rule_based_AI(state)  
        state, reward, done, info = env.step(action)
        if done:
            # record -1 if rule-based AI player won
            results.append(-1)
            break
        # estimate action probabilities and future rewards
        onehot_state = onehot_encoder(state)
        action_probs, _ = model(onehot_state)
        # select action with the highest probability
        action=np.argmax(action_probs[0])+1
        state, reward, done, info = env.step(action)
        if done:
            # record 1 if AC agent won
            results.append(1)            
            break            

We can count how many times the AC agent has won

In [11]:
# Print out the number of games that AC won
wins=results.count(1)
print(f"The AC player has won {wins} games")
# Print out the number of games that AI lost
losses=results.count(-1)
print(f"The AC player has lost {losses} games")  

The AC player has won 100 games
The AC player has lost 0 games


The AC agent plays perfectly and wins all 100 games.

## 3.2. Stochastic Policy from the Trained Model
What if the AC player uses a stochastic policy? That is, the agent will randomly choose actions, with probabilities recommended by the policy network from the trained model. We'll test such a strategy next. The AC player uses a stochastic policy to play 100 games against the rule-based AI: 

In [12]:
results=[]
for i in range(100):
    env = coin_game()
    state=env.reset()     
    while True:    
        action = rule_based_AI(state)  
        state, reward, done, info = env.step(action)
        if done:
            # record -1 if rule-based AI player won
            results.append(-1)
            break
        # estimate action probabilities and future rewards
        onehot_state = onehot_encoder(state)
        action_probs, _ = model(onehot_state)
        # select action with the highest probability
        action=np.random.choice(num_actions,p=np.squeeze(action_probs))+1
        state, reward, done, info = env.step(action)
        if done:
            # record 1 if AC agent won
            results.append(1)            
            break            

We can count how many times the AC agent has won

In [13]:
# Print out the number of games that AC won
wins=results.count(1)
print(f"The AC player has won {wins} games")
# Print out the number of games that AI lost
losses=results.count(-1)
print(f"The AC player has lost {losses} games")  

The AC player has won 100 games
The AC player has lost 0 games


The AC agent has won all 100 games as well!

# 4. What If the AC Agent Moves First?
We train the AC agent against a perfect rule-based AI player. Since if both sides play perfectly, the second player always wins, so we let the AC agent move second and see if it can learn to play perfectly. The answer is yes based on results from the last section.

Now, what if teh AC agent moves first? The opponent cannot be a perfect player, so we select a random player to play against the AC agent. We need to change a few things.

## 4.1. Define A playing_first() Function
The playing_first() function simulates a full game, with the AC agent as the first player, and the random player as the second player, as follows:

In [14]:
def playing_first():
    # create lists to record game history
    action_probs_history = []
    critic_value_history = []
    rewards_history = []
    state = env.reset()
    episode_reward = 0
    while True:
        # convert state to onehot to feed to model
        onehot_state = onehot_encoder(state)
        # estimate action probabilities and future rewards
        action_probs, critic_value = model(onehot_state)
        # record value history
        critic_value_history.append(critic_value[0, 0])
        # select action based on the policy network
        action=np.random.choice(num_actions, p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(tf.math.log(action_probs[0, action]))
        # Apply the sampled action in our environment
        # Remember to add 1 to action since the actual actions are 1 and 2
        state, reward, done, _ = env.step(action+1)
        if done:
            rewards_history.append(reward)
            episode_reward += reward 
            break
        else:
            state, reward, done, _ = env.step(random.choice([1,2]))
            rewards_history.append(reward)
            episode_reward += reward                 
            if done:
                break
    return action_probs_history,critic_value_history, \
            rewards_history, episode_reward

The playing() function simulates a full game, and it records all the intermediate steps made by the AC agent. The function returns four values: 
* a list action_probs_history with the natural logorithm of the recommended probability of the action taken by the agent from the policy network; 
* a list critic_value_history with the estimated future rewards from the value network; 
* rewards_history with the rewards to each action taken by the agent in the game;
* a number episode_reward showing the total rewards to the agent during the game. 

The playing() function plays a full game and calculates the gradients and rewards.

In reinforcement learning, actions affect not only current period rewards, but also future rewards. We therefore use discounted rewards to assign credits properly as follows:

## 4.2. Update Parameters
Instead of updating model parameters after one episode, we update after a certain number of episodes to make the model stable. Here we update parameters every ten games, as follows

In [15]:
batch_size=10
def create_batch_first(batch_size):
    action_probs_history = []
    critic_value_history = []
    rewards_history = []
    episode_rewards = []
    for i in range(batch_size):
        aps,cvs,rs,er = playing_first()
        returns = discount_rs(rs)
        action_probs_history += aps
        critic_value_history += cvs
        rewards_history += returns
        episode_rewards.append(er)
    return action_probs_history,critic_value_history,\
        rewards_history,episode_rewards

In [16]:
from collections import deque

running_rewards=deque(maxlen=100)
gamma = 0.95  
episode_count = 0
# Train the model
while True:
    with tf.GradientTape() as tape:
        action_probs_history,critic_value_history,\
            rewards_history,episode_rewards=create_batch_first(batch_size)
        # Calculating loss values to update our network
        history = zip(action_probs_history, critic_value_history, rewards_history)
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in history:
            # Calculate actor loss
            diff = ret - value
            actor_losses.append(-log_prob * diff)
            # Calculate critic loss
            critic_losses.append(
                loss_func(tf.expand_dims(value, 0), tf.expand_dims(ret, 0))
            )
        # Adjust model parameters
        loss_value = sum(actor_losses) + sum(critic_losses)
    grads = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

    # Log details
    episode_count += batch_size
    for r in episode_rewards:
        running_rewards.append(r)
    running_reward=np.mean(np.array(running_rewards)) 
    # print out progress
    if episode_count % 100 == 0:
        template = "running reward: {:.6f} at episode {}"
        print(template.format(running_reward, episode_count))   
    # Stop if the game is solved
    if running_reward > 0.999 and episode_count>100:  
        print("Solved at episode {}!".format(episode_count))
        break
model.save("files/ch15/AC_coin_first.h5")

running reward: 1.000000 at episode 100
Solved at episode 110!


Note here we adjust the parameters by the preduct of teh gradients and the discounted rewards. This is related to the solution to the rewards maximizatip probelm 

# 5. Play Games with the New Trained Model

We'll use the trained model to play against the rule-based AI. We'll use two strategies: deterministic policy and stochastic policy. If you choose a deterministic policy, you select teh move with the highest probabiltity. On the other hand, if you choose stochastic policy, you select the moves according to the probability produced by the policy network. 

## 5.1. Deterministic Policy from the Trained Model
You play 1000 games against the rule-based AI: 

In [17]:
results=[]
for i in range(100):
    env = coin_game()
    state=env.reset()     
    while True:    
        action = random.choice([1,2])  
        state, reward, done, info = env.step(action)
        if done:
            results.append(-1)
            break
        # estimate action probabilities and future rewards
        onehot_state = onehot_encoder(state)
        action_probs, _ = model(onehot_state)
        # select action with the highest probability
        action=np.argmax(action_probs[0])+1
        state, reward, done, info = env.step(action)
        if done:
            # record 1 if AC agent won
            results.append(1)            
            break            

We can count how many times the AC agent has won

In [18]:
# Print out the number of games that AC won
wins=results.count(1)
print(f"The AC player has won {wins} games")
# Print out the number of games that AI lost
losses=results.count(-1)
print(f"The AC player has lost {losses} games")  

The AC player has won 100 games
The AC player has lost 0 games


The AC player still plays perfectly. 

## 5.2. Stochastic Policy from the Trained Model
You play 100 games against the rule-based AI: 

We test ten games and see on averge how many consecutive steps the cart pole can stay upright. We define a test_a_game() function

In [19]:
results=[]
for i in range(100):
    env = coin_game()
    state=env.reset()     
    while True:    
        action = random.choice([1,2])  
        state, reward, done, info = env.step(action)
        if done:
            # record -1 if rule-based AI player won
            results.append(-1)
            break
        # estimate action probabilities and future rewards
        onehot_state = onehot_encoder(state)
        action_probs, _ = model(onehot_state)
        # select action with the highest probability
        action=np.random.choice(num_actions,p=np.squeeze(action_probs))+1
        state, reward, done, info = env.step(action)
        if done:
            # record 1 if AC agent won
            results.append(1)            
            break            

We can count how many times the AC agent has won

In [20]:
# Print out the number of games that AC won
wins=results.count(1)
print(f"The AC player has won {wins} games")
# Print out the number of games that AI lost
losses=results.count(-1)
print(f"The AC player has lost {losses} games")  

The AC player has won 99 games
The AC player has lost 1 games


The AC player has won 99 out of 100 games. 