# Chaper 20: Iterative Self-Play and AlphaZero in Tic Tac Toe



***
***“This neural network improves the strength of the tree search, resulting in higher quality
move selection and stronger self-play in the next iteration.”***

-- David Silver et al,  Mastering the game of Go without human knowledge (Nature 2017) 
***



What you'll learn in this chapter:

* Creating an AlphaZero algorithm that can be applied to Tic Tac Toe and Connect Four
* Implementing AlphaZero in Tic Tac Toe by combining MCTS with a policy gradient network
* Implementing MiniMax tree search in the coin game
* Testing the effectiveness of the MiniMax agent

In the previous chapter, you successfully developed a streamlined version of AlphaZero for the coin game. The essence of AlphaZero's approach lies in honing a gaming strategy through self-play and deep reinforcement learning exclusively, bypassing the need for guidance from human experts or any specific knowledge beyond the rules of the game. A distinguishing feature of AlphaZero, setting it apart from its predecessor, AlphaGo, is its utilization of the actor-critic technique. This method employs two distinct networks for output: the policy network, which forecasts the most advantageous subsequent actions, and the value network, which estimates the likely victor of the match. You acquired skills in training an actor-critic agent for the coin game and integrting it with Monte Carlo tree search (MCTS) to forge an AlphaZero agent. This AlphaZero agent is capable of flawlessly solving the game and executing impeccable moves in the coin game.

In this chapter, we will guide you through the process of constructing an AlphaZero agent for Tic Tac Toe. However, it's important to recognize that in straightforward games like Tic Tac Toe or Connect Four, the contribution of the value network to enhancing game strategies is minimal. Although a value network eliminates the need to play out game simulations all the way till the terminal state, employing a neural network for game state evaluation can also be time-consuming. Therefore, the effectiveness of using a value network in improving game strategy remains an empirical question. In Chapter 18, we have demonstrated that forgoing the value network in favor of rolling out games all the way to their terminal states is the most effective strategy.

Given these insights, we will not develop actor-critic networks for Tic Tac Toe or Connect Four. Instead, we will use the policy-gradient method for these games, which is advantageous because it involves only a single output network, making the training process more efficient.

More precisely, in this chapter, you will learn to construct an AlphaZero agent for Tic Tac Toe by initially developing a policy gradient agent. We will then integrate this policy-gradient agent with Monte Carlo Tree Search (MCTS) to develop an AlphaGo agent. This agent will select child nodes based on recommendations from the policy-gradient network in addition to the rollout value, offering a strategic blend of guided exploration and evaluation.

In this chapter, we develop an AlphaZero agent that is multifunctional, capable of managing both Tic Tac Toe and Connect Four. This adaptability minimizes code duplication and streamlines the process of applying the AlphaZero algorithm across a wider array of games. As we proceed to the next chapter to create an AlphaZero agent for Connect Four, our attention can be dedicated to developing the most effective policy gradient network without the concern of integrating it with MCTS to formulate an AlphaZero algorithm.


Now back to training the AlphaZero agent in Tic Tac Toe. In the beginning, the policy gradient network is untrained and initialized with random weights. During the training phase, the policy gradient agent competes against a more advanced version of itself: the AlphaZero agent. As training progresses, both agents gradually improve their performance.

This training approach differs from the one described in Chapter 14, where the policy gradient agent faced opponents with static strategies. In the current training regimen, the policy gradient agent competes against its evolving self. This dynamic scenario presents a unique challenge, as the agent effectively faces a "moving target," complicating the training process. The opponent, in this case the AlphaZero agent, employs the policy gradient network to choose child nodes during the game. Given that this network is simultaneously being trained, improvements in the agent's capabilities directly enhance the opponent's strength, as it relies on the same policy network for making decisions in Monte Carlo Tree Search (MCTS) simulations.

To address the challenge of this moving target, an iterative self-play methodology is employed. Initially, we keep the weights of the policy gradient network, as utilized by the AlphaZero agent, constant, while updating the weights within the policy gradient network itself. Once the policy gradient agent achieves a specified performance level, we conclude the first iteration and proceed to update the weights in the policy gradient network used by the AlphaZero agent. This process is repeated in successive iterations until the AlphaZero agent perfects its gameplay and consistently solves the Tic Tac Toe game.

# 1. An AlphaGo Agent for Multiple Games

## 1.1. Select, expand, roll out, and backpropagate

```python
def select(priors,env,results,weight):    
    # weighted average of priors and rollout_value
    scores={}
    for k,v in results.items():
        # rollout_value for each next move
        if len(v)==0:
            vi=0
        else:
            vi=sum(v)/len(v);
        # scale the prior by (1+N(L))
        prior=priors[0][k-1]/(1+len(v))
        # calculate weighted average
        scores[k]=weight*prior+(1-weight)*vi
    # select child node based on the weighted average     
    return max(scores,key=scores.get) 
```

```python
# expand the game tree by taking a hypothetical move
def expand(env,move):
    env_copy=deepcopy(env)
    state,reward,done,info=env_copy.step(move)
    return env_copy, done, reward
```

```python
# roll out a game till terminal state or depth reached
def simulate(env_copy,done,reward):
    # if the game has already ended
    if done==True:
        return reward
    # select moves based on fast policy network
    while True:
        move=env_copy.sample()
        state,reward,done,info=env_copy.step(move)
        # if terminal state is reached, returns outcome
        if done==True:
            return reward
```

```python
def backpropagate(env,move,reward,results):
    # update results
    if env.turn=="X" or env.turn=="red":
        results[move].append(reward)
    # if current player is player 2,
    # multiply outcome with -1
    if env.turn=="O" or env.turn=="yellow":
        results[move].append(-reward)                  
    return results
```

## 1.2 AlphaZero for Tic Tac Toe and Connect Four

```python
def alphazero(env,weight,PG_net,num_rollouts=100):
    # if there is only one valid move left, take it
    if len(env.validinputs)==1:
        return env.validinputs[0]
    # get the prior from the PG policy network
    if env.turn=="X" or env.turn=="O":
        state = env.state.reshape(-1,9)
        conv_state = state.reshape(-1,3,3,1)
        if env.turn=="X":
            priors = PG_net([state,conv_state])
        elif env.turn=="O":
            priors = PG_net([-state,-conv_state])  
    if env.turn=="red" or env.turn=="yellow":
        state = env.state.reshape(-1,42)
        conv_state = state.reshape(-1,7,6,1)
        if env.turn=="red":
            priors = PG_net([state,conv_state])
        elif env.turn=="yellow":
            priors = PG_net([-state,-conv_state])          
    # create a dictionary results
    results={}
    for move in env.validinputs:
        results[move]=[]
    # roll out games
    for _ in range(num_rollouts):
        # select
        move=select(priors,env,results,weight)
        # expand
        env_copy, done, reward=expand(env,move)
        # simulate
        reward=simulate(env_copy,done,reward)
        # backpropagate
        results=backpropagate(env,move,reward,results)
    # select the most visited child node
    visits={k:len(v) for k,v in results.items()}
    return max(visits,key=visits.get)
```

# 2. A Blueprint to Train AlphaZero in Tic Tac Toe

## 2.1. Steps to Train AlphaZero in Tic Tac Toe


## 2.2. A Policy Gradient Network in Tic Tac Toe


In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

num_inputs = 9
num_actions = 9
# The convolutional input layer
conv_inputs = layers.Input(shape=(3,3,1))
conv=layers.Conv2D(filters=64, kernel_size=(3,3),padding="same",
     input_shape=(3,3,1), activation="relu")(conv_inputs)
# Flatten the output from the conv layer
flat = layers.Flatten()(conv)
# The dense input layer
inputs = layers.Input(shape=(9,))
# Combine the two into a single input layer
two_inputs = tf.concat([flat,inputs],axis=1)
# two hidden layers
common = layers.Dense(128, activation="relu")(two_inputs)
action = layers.Dense(32, activation="relu")(common)
# Policy output network 
action = layers.Dense(num_actions, activation="softmax")(action)
# The final model
model = keras.Model(inputs=[inputs, conv_inputs],\
                    outputs=action)  

In [2]:
optimizer=keras.optimizers.Adam(learning_rate=0.00025,
                                clipnorm=1)   

# 3. Train AlphaZero in Tic Tac Toe


## 3.1. Train Players X and O


In [3]:
# An old policy gradient network that doesn't update 
old_model = keras.Model(inputs=[inputs, conv_inputs],\
                    outputs=action)

from utils.ch20util import alphazero

# define an opponent for the policy gradient agent
weight=0.5
num_rollouts=100
def opponent(env):
    move=alphazero(env,weight,old_model,
                   num_rollouts=num_rollouts) 
    return move                        

In [4]:
from utils.ttt_simple_env import ttt
import numpy as np

# allow a maximum of 50 steps per game
max_steps=50
env=ttt()

def playing_X():
    # create lists to record game history
    action_probs_history = []
    wrongmoves_history = []
    rewards_history = []
    state = env.reset()
    for step in range(max_steps):
        state = state.reshape(-1,9,)
        conv_state = state.reshape(-1,3,3,1)
        # Predict action probabilities and future rewards
        action_probs = model([state,conv_state])
        # select action based on the policy network
        action=np.random.choice(num_actions,\
                                p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(\
                        tf.math.log(action_probs[0, action]))
        # punish the agent if there is an illegal move
        if action+1 not in env.validinputs:
            rewards_history.append(0)
            wrongmoves_history.append(-1)
        # otherwise, place the move on the game board
        else:              
        # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action+1)
            if done:
                wrongmoves_history.append(0)
                rewards_history.append(reward)
                break
            else:
                state,reward,done,_=env.step(opponent(env))
                rewards_history.append(reward)
                wrongmoves_history.append(0)              
                if done:
                    break                
    return action_probs_history,\
            wrongmoves_history,rewards_history 

In [5]:
gamma=0.95
def discount_rs(r,wrong):
    discounted_rs = np.zeros(len(r))
    running_add = 0
    for i in reversed(range(0, len(r))):
        if wrong[i]==0:  
            running_add = gamma*running_add + r[i]
            discounted_rs[i] = running_add  
    return discounted_rs.tolist() 

In [6]:
def playing_O():
    # create lists to record game history
    action_probs_history = []
    wrongmoves_history = []
    rewards_history = []
    state = env.reset()
    # the opponent moves first
    state, reward, done, _ = env.step(opponent(env))
    for step in range(max_steps):
        state = state.reshape(-1,9,)
        conv_state = state.reshape(-1,3,3,1)
        # predict action probabilities; multiply the board by -1
        action_probs = model([-state,-conv_state])
        # select action based on the policy network
        action=np.random.choice(num_actions,\
                                p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(\
                        tf.math.log(action_probs[0, action]))
        # punish the agent if there is an illegal move
        if action+1 not in env.validinputs:
            rewards_history.append(0)
            wrongmoves_history.append(-1)
        # otherwise, place the move on the game board
        else:              
        # apply the sampled action in our environment
            state, reward, done, _ = env.step(action+1)
            if done:
                wrongmoves_history.append(0)
                rewards_history.append(-reward)
                break
            else:
                state,reward,done,_=env.step(opponent(env))
                rewards_history.append(-reward)
                wrongmoves_history.append(0)            
                if done:
                    break                
    return action_probs_history,\
            wrongmoves_history,rewards_history

## 3.2. Update Parameters in the Policy Gradient Network


In [7]:
batch_size=10     
def create_batch_X(batch_size):
    action_probs_history = []
    wrongmoves_history = []
    rewards_history = []
    for i in range(batch_size):
        aps,wms,rs = playing_X()
        # rewards are discounted
        returns = discount_rs(rs,wms)
        action_probs_history += aps
        # punishments for wrong moves are not discounted
        wrongmoves_history += wms
        # combine discounted rewards with punishments
        combined=np.array(returns)+np.array(wms)
        # add combined rewards to rewards history
        rewards_history += combined.tolist()
    return action_probs_history,rewards_history  

In [8]:
def create_batch_O(batch_size):
    action_probs_history = []
    wrongmoves_history = []
    rewards_history = []
    for i in range(batch_size):
        aps,wms,rs = playing_O()
        # reward related to legal moves are discounted
        returns = discount_rs(rs,wms)
        action_probs_history += aps
        # punishments for wrong moves are not discounted
        wrongmoves_history += wms
        # combine discounted rewards with punishments
        combined=np.array(returns)+np.array(wms)
        # add combined rewards to rewards history
        rewards_history += combined.tolist()
    return action_probs_history,rewards_history  

## 3.3. The Training Loop


In [9]:
from utils.ch06util import MiniMax_ab

def rule_based_AI(env):
    move = MiniMax_ab(env)
    return move    

def test():
    results=[]
    for i in range(10):
        state=env.reset()     
        while True: 
            action = rule_based_AI(env)  
            state, reward, done, info = env.step(action)
            if done:
                results.append(-abs(reward))
                break            
            # AlphaZero moves
            action=alphazero(env,weight,model,
                   num_rollouts=num_rollouts)
            state, reward, done, info = env.step(action)
            if done:
                results.append(abs(reward))            
                break              
    return results   

In [10]:
from collections import deque


tests=deque(maxlen=100) 
n = 0
batches=0
# Train the model
while True:
    with tf.GradientTape() as tape:
        if batches%2==0:
            action_probs_history,\
    rewards_history=create_batch_X(batch_size)
        else:
            action_probs_history,\
    rewards_history=create_batch_O(batch_size)                    
        rets=tf.convert_to_tensor(rewards_history,\
                                  dtype=tf.float32)
        alosses=-tf.multiply(tf.convert_to_tensor(\
          action_probs_history,dtype=tf.float32),rets)
        loss_value = tf.reduce_sum(alosses) 
    grads = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(grads,\
                                  model.trainable_variables))
    n += batch_size
    batches += 1
    if n % 100 == 0:   
        results=test()
        tests += results
        losses = tests.count(-1)
        print(f"at episode {n}, lost {losses} games")
        if (losses<=40 and n>=1000) or n>=10000:
            print(f"Finished at episode {n}!")
            break
model.save("files/zero_ttt0.h5") 

In [11]:
from collections import deque

for i in range(4):
    weight=0.5+(i+1)*0.1
    num_rollouts=100+(i+1)*100
    reload=keras.models.load_model(f"files/zero_ttt{i}.h5")
    model.set_weights(reload.get_weights()) 
    # update weights in the opponent
    old_model.set_weights(reload.get_weights()) 
    tests=deque(maxlen=100) 
    n = 0
    batches=0
    # Train the model
    while True:
        with tf.GradientTape() as tape:
            if batches%2==0:
                action_probs_history,\
        rewards_history=create_batch_X(batch_size)
            else:
                action_probs_history,\
        rewards_history=create_batch_O(batch_size)                    
            rets=tf.convert_to_tensor(rewards_history,\
                                      dtype=tf.float32)
            alosses=-tf.multiply(tf.convert_to_tensor(\
              action_probs_history,dtype=tf.float32),rets)
            loss_value = tf.reduce_sum(alosses) 
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads,\
                                      model.trainable_variables))
        n += batch_size
        batches += 1
        if n % 100 == 0:   
            results=test()
            tests += results
            losses = tests.count(-1)
            print(f"at episode {n}, lost {losses} games")
            if (losses<=40-(i+1)*10 and n>=1000) or n>=10000:
                print(f"Finished at episode {n}!")
                break
    model.save(f"files/zero_ttt{i+1}.h5") 

# 4. Test AlphaZero in Tic Tac Toe

In [12]:
# Use the trained PG model
reload=keras.models.load_model("files/zero_ttt4.h5")
model.set_weights(reload.get_weights()) 
env=ttt()
for i in range(10):
    state=env.reset()  
    if i%2==0:
        action=rule_based_AI(env)  
        state,reward,done,_=env.step(action)        
    while True: 
        # AlphaZero moves
        action=alphazero(env,0.9,model,
                   num_rollouts=500) 
        state,reward,done,_=env.step(action)     
        if done:
            if reward==0:
                print("The game is tied!")
            else:
                print("AlphaZero wins!")
            break      
        # MiniMax agent moves
        action=rule_based_AI(env)  
        state,reward,done,_=env.step(action)
        if done:
            if reward==0:
                print("The game is tied!")
            else:
                print("AlphaZero loses!")
            break           

The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
