# Chaper 20: Alpha Zero: The Coin Game Version

In 2017, the DeepMind team published an improved version of AlphaGo, which they dubbed AlphaGo Zero. In contrast to the previous version of AlphaGo that beat the World Go Champion Lee Sedol in 2016, AlphaGo Zero was trained soley on deep reinforcement learning. The algorithm didn't use any human input or domain knowledge other than game rules. AlphaGo Zero was trained through self-play from scratch. This has huge implications for AI: to tackle the world's most complicated problems, human input and domain knowledge are limited in many cases. How can AI provide powrful solutions in such scenarios is challenging. AlphaGo Zero's success, however, gives us hope. 

In the next three chapters, you'll learn how to play simple games such as the coing game without any training data or human input other than game rules. We'll start from scratch and use only self-play to train the models. You'll see that intelligent game strategies can be created this way. 

In this chapter, you'll learn to train a perfect game strategy to play the coin game through self-play only. Specifically, we'll create an actor-critic (AC) deep reinforcement model as we did in previous chapters. However, we'll not use ruled-based AI players to train the model. Instead, we'll let the policy network in the AC model to guide the players to select moves. At the same time, we'll create an AC agent to act as the opponent in the training process, hence the term "self-play". After just a few thousand games, the AC model is able to guide players to make perfect moves. In particular, the trained model will win the game 100% of the time if it moves second, even against the perfect rule-based AI player that we built in Chapter 1. 

***
$\mathbf{\text{Create a subfolder for files in Chapter 21}}$<br>
***
We'll put all files in Chapter 21 in a subfolder /files/ch21. Run the code in the cell below to create the subfolder.

***

In [1]:
import os

os.makedirs("files/ch21", exist_ok=True)

# 1. An Overview of the Training Process
In this section, we'll first summarize the steps to train the Alpha Zero agent for the coin game from scratch without any data or human input. We'll then lay out the deep reinforcement learning model we use for this purpose.

## 1.1. Steps to Train An Alpha Zero Agent
We'll first create a deep reinforcement model for the coin game with the actor-critc method as we did in the last few chapters. As usual, the model has a policy network and a value network. 

Once we have a model, we'll train player 1 and player 2 in the coin game. In the first ten games, we'll let Player 1 selects moves based on the recommendations from the policy network from the deep reinforcement model we just created. Player 1 plays against the model itself, hence the name self-play. We collect information from the first ten games such as the natural logarithms of the predicted probabilities, discounted rewards, and the expected game outcome from the value network to update model paramaters (i.e., train the model). In the second ten games, we'll let Player 2 selects moves based on recommendations from the same policy network in the model to play against the model itself. We'll also collect information such as the natural logarithms of the predicted probabilities, discounted rewards, and the expected game outcome from the value network to update model paramaters (i.e., train the model). We'll alternate between training Player 1 and Player 2 after every ten games. 

To make sure we know when to stop training, we'll test the model after every 1000 episodes: if trained model can play the game perfectly as the second player against the rule-based AI player, we'll stop training.  

## 1.2. A Deep Reinforcement Model for the Coin Game
We'll create a model with just one input layer with 22 values in it. The model has two output layers: a policy network (actor) to predict the next move and a value network to predict the expected game outcome. The modle is the same as the model we used in Chapter 15.

Specifically, the model we use is below:

In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

num_inputs = 22
num_actions = 2
# The input layer
inputs = layers.Input(shape=(22,))
# The common layer
common = layers.Dense(32, activation="relu")(inputs)
common = layers.Dense(32, activation="relu")(common)
# The policy network
action = layers.Dense(num_actions, activation="softmax")(common)
# The value network
critic = layers.Dense(1, activation="tanh")(common)
model = keras.Model(inputs=inputs, outputs=[action, critic])

Below, we specify the optimizer and the loss function we use to train the model. 

In [3]:
optimizer = keras.optimizers.Adam(learning_rate=0.001)
loss_func = keras.losses.MeanAbsoluteError()

The optimizer is Adam with a learning rate of 0.01. The loss function is the mean absolute error loss function, which punishes outliner less compared to other loss functions such as Huber or mean squared error. 

# 2. Train Player 1
In this section, we'll discuss in detail how to train the model for Player 1. Specifically, we'll create an AC agent based on the deep reinforcement learning model we created in the last section. The AC agent will act as the opponents for both Player 1 and Player 2 in self-plays. 

We then simulate ten games at a time. After ten games, we'll collect information such as the predicted probabilities, discounted rewards, and the expected game outcome to update the model parameters. 

## 2.1. Create An Opponent
We'll let Player 1 play against an AC agent based on the same deep reinforcement learning model. Later, we'll use the same AC agent as the opponent for Player 2 as well. 

For that purpose, we define a AC_agent() function as follows:

In [4]:
import numpy as np

def onehot_encoder(state):
    onehot=np.zeros((1,22))
    onehot[0,state]=1
    return onehot

def ACplayer(env): 
    # estimate action probabilities and future rewards
    onehot_state = onehot_encoder(env.state)
    action_probs, _ = model(onehot_state)
    # select action with the highest probability
    action=np.random.choice([1,2],p=np.squeeze(action_probs))   
    return action

## 2.2. Simulate Games for Player 1
Next, we'll define a playing_1() function. The playing_1() function simulates a full game, with Player 1 selecting moves from the policy network in the model, as follows:

In [5]:
from utils.coin_simple_env import coin_game
env=coin_game()
def playing_1():
    # create lists to record game history
    action_probs_history = []
    critic_value_history = []
    rewards_history = []
    state = env.reset()
    while True:
        # estimate action probabilities and future rewards
        onehot_state = onehot_encoder(env.state)
        action_probs, critic_value = model(onehot_state)
        # select action based on policy network
        action=np.random.choice([1,2],p=np.squeeze(action_probs))
        # record value history
        critic_value_history.append(critic_value[0, 0])
        # record log probabilities
        action_probs_history.append(\
                    tf.math.log(action_probs[0, action-1]))      
        # Apply the sampled action in our environment
        state, reward, done, _ = env.step(action)
        if done:
            rewards_history.append(reward)
            break
        else:
            state, reward, done, _ = env.step(ACplayer(env))
            rewards_history.append(reward)               
            if done:
                break                
    return action_probs_history,critic_value_history,\
        rewards_history

The playing_red() function simulates a full game, and it records all the intermediate steps made by the AC agent. The function returns five values: 
* a list action_probs_history with the natural logorithm of the recommended probability of the action taken by the agent from the policy network; 
* a list critic_value_history with the estimated future rewards from the value network; 
* rewards_history with the rewards to each action taken by the agent in the game;
* wroingmoves_history with the rewards to each action associated with wrong moves;
* a number episode_reward showing the total rewards to the agent during the game. 

In reinforcement learning, actions affect not only current period rewards, but also future rewards. This is the credit assignment problem, which is at the heart of every reinforcement learing algorithm. The solution is to give credits to previous moves by using discounted rewards. We therefore use discounted rewards to assign credits properly as follows:

In [6]:
def discount_rs(r):
    discounted_rs = np.zeros(len(r))
    running_add = 0
    for i in reversed(range(0, len(r))):
        running_add = gamma*running_add + r[i]
        discounted_rs[i] = running_add  
    return discounted_rs.tolist()

## 1.3. Update Parameters
Instead of updating model parameters after one episode, we update after a certain number of episodes to make the model stable. Here we update parameters every ten games, as follows

In [7]:
batch_size=10
def create_batch(playing_func):
    action_probs_history = []
    critic_value_history = []
    rewards_history = []
    for i in range(batch_size):
        aps,cvs,rs = playing_func()
        # rewards are discounted
        returns = discount_rs(rs)
        action_probs_history += aps
        critic_value_history += cvs
        rewards_history += returns       
    return action_probs_history,critic_value_history,\
        rewards_history

In [8]:
gamma=0.95
def train_player(playing_func):
    # Train the model for one batch (ten games)
    with tf.GradientTape() as tape:
        action_probs_history,critic_value_history,\
            rewards_history=create_batch(playing_func)
        # Calculating loss values to update our network        
        tfdif=tf.convert_to_tensor(rewards_history,\
                                   dtype=tf.float32)-\
        tf.convert_to_tensor(critic_value_history,dtype=tf.float32)
        alosses=-tf.multiply(tf.convert_to_tensor(\
          action_probs_history,dtype=tf.float32),tfdif)
        closs=loss_func(tf.convert_to_tensor(rewards_history,\
                                             dtype=tf.float32),\
     tf.convert_to_tensor(critic_value_history,dtype=tf.float32))
        # Backpropagation
        loss_value = tf.reduce_sum(alosses) + closs
    grads = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(grads,\
                                  model.trainable_variables)) 

Note here we adjust the parameters by the preduct of the gradients and the discounted rewards. This is related to the solution to the rewards maximization problem.

# 3. Train Player 2
In the last section, you learned how to train the AC agent to move first to play against a rule-based AI. Next, we'll use the same actor-critic model to train the AC agent to move second to play again a rule-based AI. 



Next we define the playing_2() function to simulate a full game, with the rule-based AI as the first player, and the actor-critic (AC) agent as the second player:

In [9]:
def playing_2():
    # create lists to record game history
    action_probs_history = []
    critic_value_history = []
    rewards_history = []
    state = env.reset()
    state, reward, done, _ = env.step(ACplayer(env))
    while True:
        # estimate action probabilities and future rewards
        onehot_state = onehot_encoder(env.state)
        action_probs, critic_value = model(onehot_state)
        # select action based on policy network
        action=np.random.choice([1,2],p=np.squeeze(action_probs))
        # record value history
        critic_value_history.append(critic_value[0, 0])
        # record log probabilities
        action_probs_history.append(\
                        tf.math.log(action_probs[0, action-1]))      
        # Apply the sampled action in our environment
        state, reward, done, _ = env.step(action)
        if done:
            rewards_history.append(-reward)
            break
        else:
            state, reward, done, _ = env.step(ACplayer(env))
            rewards_history.append(-reward)                
            if done:
                break                
    return action_probs_history,critic_value_history, \
            rewards_history 

We'll call the function playing_1() to train Player 1, and playing_2() to train Player 2.  

Now that we have a model that can be used to create game strategies for both players, we'll just need to train the model.

# 3. Train the Model for Both Players

To alternate between training the red player and training the yellow player, we'll create a variable *batches*. It starts with a value of 1. After each batch of training, we add 1 to the value of the variable *batches*. We'll train the red player when the value of *batches* is even and train the yellow player otherwise. 

## 3.1. Keep Track of Progress

Alpha Zero algorithms are extremely costly in terms of computational needs. Therefore, it takes a long time to train the model. To mkae things more complicated, we need to tune the hyper parameters to make the model works (such as the larning rate, the loss function, and so on). Therefore it's important to mesure the progress of the model so that you know whehter things are going in the right dirction or not.

To do that, we'll test teh model against the rule-based AI to determine the rpogress. Note that the rule-based AI is not involved in traing hte model directly. It just check how good the model is periodially without itnerfereing with teh model itself. 

In [10]:
import random

def rule_based_AI(state):
    if state%3 != 0:
        move = state%3
    else:
        move = random.choice([1,2])
    return move

def test():
    results=[]
    for i in range(100):
        env = coin_game()
        state=env.reset()     
        while True:    
            action = rule_based_AI(state)  
            state, reward, done, info = env.step(action)
            if done:
                # record -1 if rule-based AI player won
                results.append(-1)
                break
            # estimate action probabilities and future rewards
            onehot_state = onehot_encoder(state)
            action_probs, _ = model(onehot_state)
            # select action with the highest probability
            action=np.argmax(action_probs[0])+1
            state, reward, done, info = env.step(action)
            if done:
                # record 1 if AC agent won
                results.append(1)            
                break  
    return results.count(1)      

## 3.2. Train the Model till It's Perfect

Train the model till it's perfect. Here after every 1000 episodes of training, we test teh model against he AI to see teh rpogress. We'll stop if the model beats the AI 100% of the time. 

In [11]:
episode_count = 0
batches=0
# Train the model
while True:
    if batches%2==0:
        train_player(playing_1)
    else:
        train_player(playing_2)
    # Log details
    episode_count += batch_size
    batches += 1
    # print out progress
    if episode_count % 1000 == 0:
        model.save("files/ch21/alphazero_coin.h5")        
        wins=test()
        print(f"at episode {episode_count}, number of wins is {wins}/100")
        if wins==100:
            break     

at episode 1000, number of wins is 1/100
at episode 2000, number of wins is 6/100
at episode 3000, number of wins is 0/100
at episode 4000, number of wins is 0/100
at episode 5000, number of wins is 47/100
at episode 6000, number of wins is 50/100
at episode 7000, number of wins is 100/100


The model is successfully trained after 7000 episodes. 

# 4. Play The Coin Game with the Trained Model

We'll use the trained model to play against the rule-based AI. We'll use stochastic policy first: select the moves according to the probabilities produced by the policy network. Later we'll try the deterministic policy as well.

## 4.1. When the Alpha Zero Agent Moves First
The Alpha Zero agent chooses the next action based on the proabbiltiy distrinution recommended by the policy network from the trained model. Below we assume the Alpha Zero agent moves first and it plays against random moves.

In [12]:
results=[]
for i in range(100):
    state=env.reset()     
    while True:    
        action = ACplayer(env)
        state, reward, done, info = env.step(action)
        if done:
            results.append(1)
            break
        action=random.choice(env.validinputs) 
        state, reward, done, info = env.step(action)
        if done:
            results.append(-1)            
            break            

Note every time time the Alpha Zero agent wins, we add 1 to the list *results*. If the Alpha Zero agent loses, we add -1 to the list *results*.  We can count how many times the Alpha Zero agent has won and lost:

In [13]:
# Print out the number of games that Alpha Zero agent won
wins=results.count(1)
print(f"The Alpha Zero agent has won {wins} games")
# Print out the number of games that Alpha Zero agent lost
losses=results.count(-1)
print(f"The Alpha Zero agent has lost {losses} games")  

The Alpha Zero agent has won 96 games
The Alpha Zero agent has lost 4 games


The Alpha Zero agent has won 96 out of 100 games. Next, we'll test how the Alpha Zero agent fairs when it moves second.

## 4.2. When the Alpha Zero Agent Moves Second
The Alpha Zero agent chooses the next action based on the proabbiltiy distrinution recommended by the policy network from the trained model. We let the Alpha Zero agent move second while the rule-based AI moves first. 

In [14]:
results=[]

for i in range(100):
    state=env.reset()     
    while True:    
        action = rule_based_AI(state)
        state, reward, done, info = env.step(action)
        if done:
            results.append(-1)
            break
        # select action based on policy network
        action= ACplayer(env)
        state, reward, done, info = env.step(action)
        if done:
            results.append(1)            
            break            

As before, every time time the Alpha Zero agent wins, we add 1 to the list *results*. If the Alpha Zero agent loses, we add -1 to the list *results*.  We can count how many times the Alpha Zero agent has won and lost:

In [15]:
# Print out the number of games that Alpha Zero agent won
wins=results.count(1)
print(f"The Alpha Zero agent has won {wins} games")
# Print out the number of games that Alpha Zero agent lost
losses=results.count(-1)
print(f"The Alpha Zero agent has lost {losses} games")   

The Alpha Zero agent has won 92 games
The Alpha Zero agent has lost 8 games


## 4.3. A Deterministic Game Strategy
We define a AlphaZeroDeterministic() function that produces the best move based on the trained AC model. The best move is the next move with the highest probability. 

In [16]:
def AlphaZeroDeterministic(env): 
    state = env.state
    onehot_state = onehot_encoder(state)
    action_probs, _ = model(onehot_state)
    action_probs, critic_value = model(onehot_state)
    idx=np.argmax(np.squeeze(action_probs))
    return idx+1

We play 100 games and let the rule-based AI move first. The Alpha Zero agent moves second.

In [17]:
results=[]

for i in range(100):
    state=env.reset()     
    while True:    
        action = rule_based_AI(state)
        state, reward, done, info = env.step(action)
        if done:
            results.append(-1)
            break
        # select action based on policy network
        action=AlphaZeroDeterministic(env)
        state, reward, done, info = env.step(action)
        if done:
            results.append(1)            
            break            

Every time time the Alpha Zero agent wins, we add 1 to the list *results*. If the Alpha Zero agent loses, we add -1 to the list *results*.  We can count how many times the Alpha Zero agent has won and lost:

In [18]:
# Print out the number of games that Alpha Zero agent won
wins=results.count(1)
print(f"The Alpha Zero agent has won {wins} games")
# Print out the number of games that Alpha Zero agent lost
losses=results.count(-1)
print(f"The Alpha Zero agent has lost {losses} games")   

The Alpha Zero agent has won 100 games
The Alpha Zero agent has lost 0 games


The Alpha Zero agent has won all 100 games. The Alpha Zero game strategy plays the Coin game perfectly. 