# Chaper 19: The Actor-Critic Method and AlphaZero 

***
***“As it plays, the neural network is tuned and updated to predict moves, as well as the eventual winner of the games.”***

-- --- David Silver and Demis Hassabis, 2017
***



In 2017, the DeepMind team introduced an advanced version of AlphaGo, named AlphaGo Zero (which we'll refer to as AlphaZero in this book because we apply the algorithm to various games beyond Go). This version marked a significant departure from its predecessor, the AlphaGo that triumphed over World Go Champion Lee Sedol in 2016. Unlike the earlier version, AlphaGo Zero's training relied exclusively on deep reinforcement learning, without any human-derived strategies or domain-specific knowledge, except for the basic rules of the game. It learned through self-play from scratch. This development has profound implications for AI. When confronting complex global issues, human expertise and domain knowledge often have their limitations. The challenge lies in how AI can devise robust solutions under such constraints. The success of AlphaGo Zero, however, offers a beacon of hope in this endeavor.

You may be curious about what sets AlphaZero apart from AlphaGo in terms of power. A key factor is AlphaZero's use of larger and more complex neural networks, which significantly boost its learning abilities. However, in the context of the simpler games covered in this book, the additional layers in the neural networks don't substantially increase benefits. Therefore, we'll not explore this aspect of AlphaZero in this book. 

Another element that has elevated AlphaZero above its predecessors is its application of an advanced deep reinforcement learning technique. As explored in the previous three chapters, the AlphaGo algorithm initially trains two policy networks using data from human experts. Following this, it employs self-play deep reinforcement learning to enhance one of these policy networks, creating a policy gradient network. It also leverages game experience data from self-play to train a value network for predicting game outcomes. In contrast, AlphaZero utilizes a singular neural network with two outputs: one *policy head* for predicting the next move and a *value head* for forecasting game results, as highlighted by David Silver and Demis Hassabis in this chapter's introductory quote.

In this chapter, we'll delve into the advanced deep reinforcement learning strategy known as the actor-critic method, applying it specifically to the coin game. The concept of the *actor* in this method is straightforward: it's the game player who must determine the most advantageous next move. This actor is essentially the policy network we've examined in the policy-gradient methods from Chapters 13 to 15.

The *critic* refers to a different role – akin to another player – who evaluates the moves made thus far. The insights provided by the critic are instrumental in refining future moves. To operationalize this, we've added an additional component to the deep neural network, resulting in two distinct outputs: a policy network for predicting the next move and a value network for estimating the game's outcome. The policy network assumes the actor's role, selecting the subsequent move, while the value network, as the critic, monitors the agent's progress throughout the game. This feedback is essential for the training process, serving a similar function to how a game review might enhance your own learning.

We'll focus on applying the actor-critic method to the coin game in this chapter. You'll learn how to develop an Actor-Critic (AC) deep reinforcement learning model. This model will feature two concurrent outputs: the policy network, serving as the 'actor', and the value network, functioning as the 'critic'. Following this, you'll learn how to train this AC deep reinforcement model. Unlike our approach with the AlphaGo algorithm, where we used rule-based AI players as experts for generating training moves, here we'll start from scratch. We'll rely solely on self-play for training, without using expert moves as a guide.

Once you've successfully trained the AC deep reinforcement model, you'll integrate both the policy and value networks with Monte Carlo Tree Search (MCTS) for making decisions in actual games, mirroring the approach we took with the AlphaGo algorithm in Chapter 16. You'll observe that the AlphaZero agent, armed with this model, performs flawlessly and solves the game. It can beat the AlphaGo agent we developed in Chapter 16 if playing as Player 2.

# 1. The Actor-Critic Method
 

# 2. An Overview of the Training Process


## 2.1. Steps to Train An AlphaZero Agent


## 2.2. An Actor-Critic Agent for the Coin Game


In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

num_inputs = 22
num_actions = 2
# The input layer
inputs = layers.Input(shape=(22,))
# The common layer
common = layers.Dense(32, activation="relu")(inputs)
common = layers.Dense(32, activation="relu")(common)
# The policy network
action = layers.Dense(num_actions, activation="softmax")(common)
# The value network
critic = layers.Dense(1, activation="tanh")(common)
model = keras.Model(inputs=inputs, outputs=[action, critic])

In [2]:
optimizer = keras.optimizers.Adam(learning_rate=0.001)
loss_func = keras.losses.MeanAbsoluteError()

# 3. Train the Actor-Critic Agent in the Coin Game


## 3.1. Train the Two Players in the Actor-Critic Model


In [3]:
import numpy as np

def onehot_encoder(state):
    onehot=np.zeros((1,22))
    onehot[0,state]=1
    return onehot

def ACplayer(env): 
    # estimate action probabilities and future rewards
    onehot_state = onehot_encoder(env.state)
    action_probs, _ = model(onehot_state)
    # select action with the highest probability
    action=np.random.choice([1,2],p=np.squeeze(action_probs))   
    return action

In [4]:
from utils.coin_simple_env import coin_game
env=coin_game()
def playing_1():
    # create lists to record game history
    action_probs_history = []
    critic_value_history = []
    rewards_history = []
    state = env.reset()
    while True:
        # estimate action probabilities and future rewards
        onehot_state = onehot_encoder(env.state)
        action_probs, critic_value = model(onehot_state)
        # select action based on policy network
        action=np.random.choice([1,2],p=np.squeeze(action_probs))
        # record value history
        critic_value_history.append(critic_value[0, 0])
        # record log probabilities
        action_probs_history.append(\
                    tf.math.log(action_probs[0, action-1]))      
        # Apply the sampled action in our environment
        state, reward, done, _ = env.step(action)
        if done:
            rewards_history.append(reward)
            break
        else:
            state, reward, done, _ = env.step(ACplayer(env))
            rewards_history.append(reward)               
            if done:
                break                
    return action_probs_history,critic_value_history,\
        rewards_history

In [5]:
def discount_rs(r):
    discounted_rs = np.zeros(len(r))
    running_add = 0
    for i in reversed(range(0, len(r))):
        running_add = gamma*running_add + r[i]
        discounted_rs[i] = running_add  
    return discounted_rs.tolist()

In [6]:
def playing_2():
    # create lists to record game history
    action_probs_history = []
    critic_value_history = []
    rewards_history = []
    state = env.reset()
    state, reward, done, _ = env.step(ACplayer(env))
    while True:
        # estimate action probabilities and future rewards
        onehot_state = onehot_encoder(env.state)
        action_probs, critic_value = model(onehot_state)
        # select action based on policy network
        action=np.random.choice([1,2],p=np.squeeze(action_probs))
        # record value history
        critic_value_history.append(critic_value[0, 0])
        # record log probabilities
        action_probs_history.append(\
                        tf.math.log(action_probs[0, action-1]))      
        # Apply the sampled action in our environment
        state, reward, done, _ = env.step(action)
        if done:
            rewards_history.append(-reward)
            break
        else:
            state, reward, done, _ = env.step(ACplayer(env))
            rewards_history.append(-reward)                
            if done:
                break                
    return action_probs_history,critic_value_history, \
            rewards_history 

## 3.2. Update Parameters During Training


In [7]:
batch_size=10
def create_batch(playing_func):
    action_probs_history = []
    critic_value_history = []
    rewards_history = []
    for i in range(batch_size):
        aps,cvs,rs = playing_func()
        # rewards are discounted
        returns = discount_rs(rs)
        action_probs_history += aps
        critic_value_history += cvs
        rewards_history += returns       
    return action_probs_history,critic_value_history,\
        rewards_history

In [8]:
gamma=0.95
def train_player(playing_func):
    # Train the model for one batch (ten games)
    with tf.GradientTape() as tape:
        action_probs_history,critic_value_history,\
            rewards_history=create_batch(playing_func)
        # Calculating loss values to update our network        
        tfdif=tf.convert_to_tensor(rewards_history,\
                                   dtype=tf.float32)-\
        tf.convert_to_tensor(critic_value_history,dtype=tf.float32)
        alosses=-tf.multiply(tf.convert_to_tensor(\
          action_probs_history,dtype=tf.float32),tfdif)
        closs=loss_func(tf.convert_to_tensor(rewards_history,\
                                             dtype=tf.float32),\
     tf.convert_to_tensor(critic_value_history,dtype=tf.float32))
        # Backpropagation
        loss_value = tf.reduce_sum(alosses) + closs
    grads = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(grads,\
                                  model.trainable_variables)) 

## 3.3. The Training Loop


In [9]:
import random

def rule_based_AI(state):
    if state%3 != 0:
        move = state%3
    else:
        move = random.choice([1,2])
    return move

def test():
    results=[]
    for i in range(100):
        env = coin_game()
        state=env.reset()     
        while True:    
            action = rule_based_AI(state)  
            state, reward, done, info = env.step(action)
            if done:
                # record -1 if rule-based AI player won
                results.append(-1)
                break
            # estimate action probabilities and future rewards
            onehot_state = onehot_encoder(state)
            action_probs, _ = model(onehot_state)
            # select action with the highest probability
            action=np.argmax(action_probs[0])+1
            state, reward, done, info = env.step(action)
            if done:
                # record 1 if AC agent won
                results.append(1)            
                break  
    return results.count(1)      

In [10]:
n = 0
batches=0
# Train the model
while True:
    if batches%2==0:
        train_player(playing_1)
    else:
        train_player(playing_2)
    # Log details
    n += batch_size
    batches += 1
    # print out progress
    if n % 1000 == 0:
        model.save("files/ac_coin.h5")        
        wins=test()
        print(f"at episode {n}, number of wins is {wins}/100")
        if wins==100:
            break     

at episode 100, number of wins is 3/100
at episode 200, number of wins is 3/100
at episode 300, number of wins is 1/100
at episode 400, number of wins is 2/100
at episode 500, number of wins is 4/100
at episode 600, number of wins is 2/100
at episode 700, number of wins is 0/100
at episode 800, number of wins is 0/100
at episode 900, number of wins is 1/100
at episode 1000, number of wins is 2/100
at episode 1100, number of wins is 1/100
at episode 1200, number of wins is 1/100
at episode 1300, number of wins is 2/100
at episode 1400, number of wins is 3/100
at episode 1500, number of wins is 1/100
at episode 1600, number of wins is 1/100
at episode 1700, number of wins is 3/100
at episode 1800, number of wins is 3/100
at episode 1900, number of wins is 4/100
at episode 2000, number of wins is 6/100
at episode 2100, number of wins is 3/100
at episode 2200, number of wins is 4/100
at episode 2300, number of wins is 1/100
at episode 2400, number of wins is 3/100
at episode 2500, number o

at episode 4000, number of wins is 21/100
at episode 4100, number of wins is 27/100
at episode 4200, number of wins is 15/100
at episode 4300, number of wins is 30/100
at episode 4400, number of wins is 24/100
at episode 4500, number of wins is 35/100
at episode 4600, number of wins is 45/100
at episode 4700, number of wins is 54/100
at episode 4800, number of wins is 56/100
at episode 4900, number of wins is 42/100
at episode 5000, number of wins is 56/100
at episode 5100, number of wins is 52/100
at episode 5200, number of wins is 48/100
at episode 5300, number of wins is 100/100


The model is successfully trained after 5300 episodes. Alternatively, you can download the trained model in the book's GitHub repository.

# 4. An AlphaZero Agent in the Coin Game



## 4.1. Select, Expand, Simulate, and Backpropagate
In the local module *ch19util*, we first load up the trained actor-critic model, as follows:

```python
from copy import deepcopy
import numpy as np
from tensorflow import keras

# Load the trained actor critic model
model=keras.models.load_model("files/ac_coin.h5")
```

```python
def select(priors,env,results,weight):    
    # weighted average of priors and rollout_value
    scores={}
    for k,v in results.items():
        # rollout_value for each next move
        if len(v)==0:
            vi=0
        else:
            vi=sum(v)/len(v)
        # scale the prior by (1+N(L))
        prior=priors[0][k-1]/(1+len(v))
        # calculate weighted average
        scores[k]=weight*prior+(1-weight)*vi
    # select child node based on the weighted average     
    return max(scores,key=scores.get)  
```

```python
# roll out a game till terminal state or depth reached
def simulate(env_copy,done,reward,depth):
    # if the game has already ended
    if done==True:
        return reward
    for _ in range(depth):
        move=env_copy.sample()
        state,reward,done,info=env_copy.step(move)
        # if terminal state is reached, returns outcome
        if done==True:
            return reward
    # depth reached but game not over, evaluate
    onehot_state=onehot_encoder(state)
    # use the trained actor critic model to evaluate
    action_probs, critic_value = model(onehot_state)
    # output is predicted game outcome  
    return critic_value[0,0]
```

## 4.2 Create An AlphaZero Agent in the Coin Game


```python
def alphazero_coin(env,weight,depth,num_rollouts=100):
    # if there is only one valid move left, take it
    if len(env.validinputs)==1:
        return env.validinputs[0]
    # get the prior from the AC policy network
    onehot_state = onehot_encoder(env.state)
    priors, _ = model(onehot_state)    
    # create a dictionary results
    results={}
    for move in env.validinputs:
        results[move]=[]
    # roll out games
    for _ in range(num_rollouts):
        # select
        move=select(priors,env,results,weight)
        # expand
        env_copy, done, reward=expand(env,move)
        # simulate
        reward=simulate(env_copy,done,reward,depth)
        # backpropagate
        results=backpropagate(env,move,reward,results)
    # select the most visited child node
    visits={k:len(v) for k,v in results.items()}
    return max(visits,key=visits.get)
```

## 4.3. AlphaZero vs AlphaGo in the Coin Game


In [11]:
from utils.coin_simple_env import coin_game
from utils.ch16util import alphago_coin
from utils.ch19util import alphazero_coin
# initiate game environment
env=coin_game()
# test ten games
for i in range(10):
    state=env.reset()  
    while True: 
        # The AlphaGo agent moves first
        action=alphago_coin(env,0.95,25,num_rollouts=1000)  
        state,reward,done,_=env.step(action)
        if done:
            # print out the winner
            print("AlphaGo wins!")
            break
        # The AlphaZero agent moves second
        action=alphazero_coin(env,0.95,25,num_rollouts=1000) 
        state,reward,done,_=env.step(action)     
        if done:
            # print out the winner
            print("AlphaZero wins!")
            break        

AlphaZero wins!
AlphaZero wins!
AlphaZero wins!
AlphaZero wins!
AlphaZero wins!
AlphaZero wins!
AlphaZero wins!
AlphaZero wins!
AlphaZero wins!
AlphaZero wins!
