# Chaper 21: AlphaZero in Unsolved Games



***
***“Our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.”***

-- David Silver et al,  Mastering the game of Go without human knowledge (Nature 2017) 
***



What you'll learn in this chapter:

* Implementing AlphaZero in Connect Four 
* Iteratively training AlphaZero in Connect Four through self-play
* Testing AlphaZero against earlier versions of itself
* Pitting AlphaZero against AlphaGo

In the previous two chapters, we implemented the AlphaZero algorithm in the coin game and Tic Tac Toe. The AlphaZero algorithm was trained purely based on deep reinforcement learning, without human expert data, guidance, or domain knowledge beyond game rules. However, we did use rule-based AI to periodically evaluate the AlphaZero agent to gauge its performance and decide whether or not to stop training. Even though rule-based AI was not used in the training process directly, it was used for testing purposes to monitor the performance of the agent to avoid unnecessary training. 

However, in unsolved games such as Chess or Go, assessing the performance of AlphaZero is more difficult since we don’t have an algorithm that’s more powerful than AlphaZero to use as the benchmark. How should we test the performance of AlphaZero and decide when to stop training in such cases? That’s the question we are going to answer in this chapter.

Even though Connect Four is a solved game, implementing a game-solving rule-based algorithm involves too many steps. Therefore, we treat Connect Four as an unsolved game. In this chapter, you’ll train an AlphaZero agent from scratch in Connect Four. To test the performance of AlphaZero and decide when to stop training, we use an earlier version of AlphaZero as the benchmark. When AlphaZero outperforms an
earlier version of itself by a certain margin, we’ll stop a training iteration. We then update the parameters in the older version of AlphaZero and restart the training process. We’ll train the AlphaZero model for several iterations to strengthen the AlphaZero agent.

After a couple of days of training, we test the trained AlphaZero agent against the AlphaGo agent we developed in Chapter 18. Our newly trained AlphaZero consistently outperforms its predecessor, AlphaGo!

# 1. Steps to Train AlphaZero in Connect Four

## 1.1. Steps to Train AlphaZero in Connect Four

## 2.2. A Policy Gradient Network in Tic Tac Toe

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

num_inputs=42
num_actions=7
# The convolutional input layer
conv_inputs=layers.Input(shape=(7,6,1))
conv=layers.Conv2D(filters=128, kernel_size=(4,4),padding="same",
     input_shape=(7,6,1), activation="relu")(conv_inputs)
flat=layers.Flatten()(conv)
# The dense input layer
inputs = layers.Input(shape=(42,))
# Combine the two into a single input layer
two_inputs = tf.concat([flat,inputs],axis=1)
# hidden layers
common = layers.Dense(256, activation="relu")(two_inputs)
common = layers.Dense(64, activation="relu")(common)
# Policy output network
action = layers.Dense(num_actions, activation="softmax")(common)
# The final model
model = keras.Model(inputs=[inputs, conv_inputs],\
                    outputs=action) 

In [2]:
optimizer=keras.optimizers.Adam(learning_rate=0.00025,
                                clipnorm=1)   

# 2. Prepare to Train AlphaZero in Connect Four

## 2.1. Train the Red and Yellow Players

In [3]:
# An old policy gradient network that doesn't update 
old_model = keras.Model(inputs=[inputs, conv_inputs],\
                    outputs=action)

from utils.ch20util import alphazero

# define an opponent for the policy gradient agent
weight=0.05
num_rollouts=50
def opponent(env):
    move=alphazero(env,weight,old_model,
                   num_rollouts=num_rollouts) 
    return move                        

In [4]:
from utils.conn_simple_env import conn
import numpy as np
# allow a maximum of 50 steps per game
max_steps=50
env=conn()

def playing_red():
    # create lists to record game history
    action_probs_history = []
    wrongmoves_history = []
    rewards_history = []
    episode_reward = 0
    state = env.reset()
    for step in range(max_steps):
        state = state.reshape(-1,42,)
        conv_state = state.reshape(-1,7,6,1)
        # Predict action probabilities and future rewards
        action_probs = model([state,conv_state])
        # select action based on the policy network
        action=np.random.choice(num_actions,\
                                p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(\
                        tf.math.log(action_probs[0, action]))
        # punish the agent if there is an illegal move
        if action+1 not in env.validinputs:
            rewards_history.append(0)
            wrongmoves_history.append(-1)
        # otherwise, place the move on the game board
        else:              
        # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action+1)
            if done:
                wrongmoves_history.append(0)
                rewards_history.append(reward)
                episode_reward += reward
                break
            else:
                state,reward,done,_=env.step(opponent(env))
                rewards_history.append(reward)
                wrongmoves_history.append(0)
                episode_reward += reward                 
                if done:
                    break                
    return action_probs_history,\
        wrongmoves_history,rewards_history,episode_reward

In [5]:
gamma=0.95
def discount_rs(r,wrong):
    discounted_rs = np.zeros(len(r))
    running_add = 0
    for i in reversed(range(0, len(r))):
        if wrong[i]==0:  
            running_add = gamma*running_add + r[i]
            discounted_rs[i] = running_add  
    return discounted_rs.tolist() 

In [6]:
def playing_yellow():
    # create lists to record game history
    action_probs_history = []
    wrongmoves_history = []
    rewards_history = []
    state = env.reset()
    episode_reward = 0
    state, reward, done, _ = env.step(opponent(env))
    for step in range(max_steps):
        state = state.reshape(-1,42,)
        conv_state = state.reshape(-1,7,6,1)
        # Predict action probabilities and future rewards
        action_probs = model([-state,-conv_state])
        # select action based on the policy network
        action=np.random.choice(num_actions,\
                                p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(\
                        tf.math.log(action_probs[0, action]))
        # punish the agent if there is an illegal move
        if action+1 not in env.validinputs:
            rewards_history.append(0)
            wrongmoves_history.append(-1)
        # otherwise, place the move on the game board
        else:              
        # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action+1)
            if done:
                wrongmoves_history.append(0)
                rewards_history.append(-reward)
                episode_reward += -reward 
                break
            else:
                state,reward,done,_=env.step(opponent(env))
                rewards_history.append(-reward)
                wrongmoves_history.append(0)
                episode_reward += -reward                 
                if done:
                    break                
    return action_probs_history,\
        wrongmoves_history,rewards_history,episode_reward

## 3.2. Update Parameters in Training

In [7]:
batch_size=20     
def create_batch_red(batch_size):
    action_probs_history = []
    wrongmoves_history = []
    rewards_history = []
    episode_rewards = []
    for i in range(batch_size):
        aps,wms,rs,er= playing_red()
        # rewards are discounted
        returns = discount_rs(rs,wms)
        action_probs_history += aps
        # punishments for wrong moves are not discounted
        wrongmoves_history += wms
        # combined discounted rewards with punishments
        combined=np.array(returns)+np.array(wms)
        # add combined rewards to rewards history
        rewards_history += combined.tolist()
        episode_rewards.append(er)  
    return action_probs_history,\
        rewards_history,episode_rewards  

In [8]:
def create_batch_yellow(batch_size):
    action_probs_history = []
    wrongmoves_history = []
    rewards_history = []
    episode_rewards = []
    for i in range(batch_size):
        aps,wms,rs,er= playing_yellow()
        # reward related to legal moves are discounted
        returns = discount_rs(rs,wms)
        action_probs_history += aps
        # punishments for wrong moves are not discounted
        wrongmoves_history += wms
        # combined discounted rewards with punishments
        combined=np.array(returns)+np.array(wms)
        # add combined rewards to rewards history
        rewards_history += combined.tolist()
        episode_rewards.append(er) 
    return action_probs_history,\
        rewards_history,episode_rewards                

# 3 Training AlphaZero in Connect Four

## 3.1. The Training Loop in the First Iteration

In [9]:
from collections import deque
import numpy as np

running_rewards=deque(maxlen=1000)
episode_count = 0
batches=0
# Train the model
while True:
    with tf.GradientTape() as tape:
        if batches%2==0:
            action_probs_history,rewards_history,\
                episode_rewards=create_batch_red(batch_size)
        else:
            action_probs_history,rewards_history,\
                episode_rewards=create_batch_yellow(batch_size)             
        # Calculating loss values to update our network        
        rets=tf.convert_to_tensor(rewards_history,\
                                  dtype=tf.float32)
        alosses=-tf.multiply(tf.convert_to_tensor(\
          action_probs_history,dtype=tf.float32),rets)
        # Backpropagation
        loss_value = tf.reduce_sum(alosses) 
    grads = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(grads,\
                                  model.trainable_variables))
    # Log details
    episode_count += batch_size
    batches += 1
    running_rewards+=episode_rewards
    running_reward=np.mean(np.array(running_rewards)) 
    # print out progress
    if episode_count % 100 == 0:
        template = "running reward: {:.6f} at episode {}"
        print(template.format(running_reward, episode_count))   
    # Stop if the game is solved
    if running_reward>=0.1 and episode_count>1000:  
        print("Solved at episode {}!".format(episode_count))
        break
    if episode_count>25000:  
        break    
model.save("files/CONNzero0.h5")                  

## 3.2 Train for More Iterations


In [10]:
from collections import deque

for i in range(4):
    weight=0.1+(i+1)*0.2
    num_rollouts=50+(i+1)*50
    reload=keras.models.load_model(f"files/CONNzero{i}.h5")
    model.set_weights(reload.get_weights()) 
    # update weights in the opponent
    old_model.set_weights(reload.get_weights()) 
    running_rewards=deque(maxlen=1000)
    episode_count = 0
    batches=0
    # Train the model
    while True:
        with tf.GradientTape() as tape:
            if batches%2==0:
                action_probs_history,rewards_history,\
                    episode_rewards=create_batch_red(batch_size)
            else:
                action_probs_history,rewards_history,\
                    episode_rewards=create_batch_yellow(batch_size)             
            # Calculating loss values to update our network        
            rets=tf.convert_to_tensor(rewards_history,\
                                      dtype=tf.float32)
            alosses=-tf.multiply(tf.convert_to_tensor(\
              action_probs_history,dtype=tf.float32),rets)
            # Backpropagation
            loss_value = tf.reduce_sum(alosses) 
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads,\
                                      model.trainable_variables))
        # Log details
        episode_count += batch_size
        batches += 1
        running_rewards+=episode_rewards
        running_reward=np.mean(np.array(running_rewards)) 
        # print out progress
        if episode_count % 100 == 0:
            template = "running reward: {:.6f} at episode {}"
            print(template.format(running_reward, episode_count))   
        # Stop if the game is solved
        if running_reward>=0.1 and episode_count>1000:  
            print("Solved at episode {}!".format(episode_count))
            break
        if episode_count>25000:  
            break    
    model.save(f"files/CONNzero{i+1}.h5")             

## 3.3 Tips for Further Training

# 4. Test AlphaZero in Connect Four

In [11]:
import numpy as np
from tensorflow import keras
from utils.ch17util import alphago

weight=0.75
num_rollouts=584

# Load the trained fast policy network from Chapter 11
fast_net=keras.models.load_model("files/policy_conn.h5")
# Load the policy gradient network from Chapter 15
PG_net=keras.models.load_model("files/PG_conn.h5")
# Load the trained value network from Chapter 15
value_net=keras.models.load_model("files/value_conn.h5")

# Define alphago_move
def alphago_move(env):
    move=alphago(env,weight,45,PG_net,value_net,
        fast_net, policy_rollout=False,
                 num_rollouts=num_rollouts)
    return move



In [12]:
# Define alphago_move
from utils.ch20util import alphazero

old_model=keras.models.load_model("files/CONNzero4.h5")
def alphazero_move(env):
    move=alphazero(env,weight,old_model,
                   num_rollouts=num_rollouts) 
    return move  



In [13]:
from utils.conn_simple_env import conn
env=conn()
results=[]
for i in range(100):
    state=env.reset()  
    if i%2==0:
        action=alphago_move(env)  
        state,reward,done,_=env.step(action)        
    while True: 
        # AlphaZero moves
        action=alphazero_move(env) 
        state,reward,done,_=env.step(action)     
        if done:
            results.append(abs(reward))
            break      
        # AlphaGo moves
        action=alphago_move(env)  
        state,reward,done,_=env.step(action)
        if done:
            results.append(-abs(reward))
            break        

In [14]:
# Print out the number of games that AlphaZero won
wins=results.count(1)
print(f"The AlphaZero agent has won {wins} games.")
# Print out the number of games that AlphaZero lost
losses=results.count(-1)
print(f"The AlphaZero agent has lost {losses} games.")
# Print out the number of tie games
ties=results.count(0)
print(f"The game was tied {ties} times.")

The AlphaZero agent has won 79 games.
The AlphaZero agent has lost 21 games.
The game was tied 0 times.
