# Chaper 17: Apply the Actor-Critic Method to Connect Four

In Chapter 16, we applied the actor-critic method to Tic Tac Toe. Specifically, you learned how to handle illegal moves and how to use vectorization to speed up training. After roughly ten hours of training, the actor-critic agent is much better than a rule-based AI who thinks three steps ahead.

In this chapter, you'll apply similar techniques to the Connect Four game. You'll handle illegal moves the same way you did in Chapter 16: assigning a reward of -1 every time the agent makes an illegal move. You'll also use vectorization to speed up training. You'll create one single model to train both the red player and the yellow player in Connect Four. We'll train the model as follows: in the first ten games, we'll let the red player select moves based on the recommendations from the policy network in the model. A think-three-steps-ahead AI acts as Player 2. We'll then collect information such as the natural logarithms of the predicted probabilities, discounted rewards, and the expected game outcome from the value network to update model paramaters (i.e., train the model). In the second ten games, we'll let the yellow player select moves based on recommendations from the policy network in the model to play against a think-three-steps-ahead AI. We'll also collect information such as the natural logarithms of the predicted probabilities, discounted rewards, and the expected game outcome from the value network to update model paramaters (i.e., train the model). We'll alternate between training the red player and training the yellow player after every ten games. After the average reward from the actor-critic agent reaches 0.8, we stop the training process.  

***
$\mathbf{\text{Create a subfolder for files in Chapter 17}}$<br>
***
We'll put all files in Chapter 17 in a subfolder /files/ch17. Run the code in the cell below to create the subfolder.

***

In [1]:
import os

os.makedirs("files/ch17", exist_ok=True)

# 1. The Actor-Critic Method in Connect Four

You'll learn how to play Connect Four using the actor critic approach. For the moment, we'll just train an actor-critic model by simulating games with the AC agent moveing first and the think-two-steps-ahead AI moveing second. Later we'll use the same model to simulate games with the AC agent moveing second. 

## 1.1. A Model for Connect Four
To save time, we'll create a model with just one input layer with 42 values in it. We'll not use a convolutional layer to extract the spatial features on the Connect Four board. Doing so is extremely computationally costly. Interested readers with access to supercomputing facilities can try this by mimicking what we did in Chatper 16 for the Tic Tac Toe game. Or you can look at Chapter 22 in which we created an actor-critic model for the Connect Four game with the convolutional input layer.  

Specifically, the model we use in this chapter is below:

In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

num_inputs = 42
num_actions = 7
# input layer 
inputs = layers.Input(shape=(42))
# common layer 
common = layers.Dense(128, activation="relu")(inputs)
# policy network
action = layers.Dense(num_actions, activation="softmax")(common)
# value network
critic = layers.Dense(1, activation="tanh")(common)
model = keras.Model(inputs=inputs, outputs=[action, critic])



The model has one input network with 42 values in it (that is, we'll flatten the Connect Four game board into a one-dimensional vector and feed it into the model). We create two output networks as we did in Chapter 16: one policy network and one value network. 

Below, we specify the optimizer and the loss function we use to train the model. 

In [3]:
optimizer = keras.optimizers.Adam(learning_rate=0.0005)
loss_func = keras.losses.MeanAbsoluteError()

The optimizer is Adam with a learning rate of 0.0005. The loss function is the mean absolute error loss function, which punishes outliner less compared to other loss functions such as Huber or mean squared error. 

## 1.2. Simulate Connect Four Games
We'll let the agent play against the look-three-steps-ahead AI that we developed in Chapter 3. Below, we import the red_think3() and yellow_think3() functions from the local module conn_think3. 

In [4]:
from utils.conn_think3 import red_think3,yellow_think3

Next, we'll define a playing_red() function. The playing_red() function simulates a full game, with the rule-based AI as the second player, and the actor-critic (AC) agent as the first player, as follows:

In [5]:
from utils.conn_simple_env import conn

# allow a maximum of 50 steps per game
max_steps=50
env=conn()
def playing_red():
    # create lists to record game history
    action_probs_history = []
    critic_value_history = []
    wrongmoves_history = []
    rewards_history = []
    state = env.reset()
    episode_reward = 0
    for step in range(max_steps):
        state = state.reshape(-1,42,)
        # estimate action probabilities and future rewards
        action_probs, critic_value = model(state)
        # record value history
        critic_value_history.append(critic_value[0, 0])
        # select action based on the policy network
        action=np.random.choice(num_actions,\
                                p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(\
                    tf.math.log(action_probs[0, action]))
        # punish the agent if there is an illegal move
        if action+1 not in env.validinputs:
            rewards_history.append(0)
            wrongmoves_history.append(-1)
        # otherwise, place the move on the game board
        else:              
        # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action+1)
            if done:
                wrongmoves_history.append(0)
                rewards_history.append(reward)
                episode_reward += reward 
                break
            else:
                state, reward, done, _ = env.step(\
                                  yellow_think3(env))
                rewards_history.append(reward)
                wrongmoves_history.append(0)
                episode_reward += reward                 
                if done:
                    break                
    return action_probs_history,critic_value_history, \
            wrongmoves_history,rewards_history, episode_reward

The playing_red() function simulates a full game, and it records all the intermediate steps made by the AC agent. The function returns five values: 
* a list action_probs_history with the natural logorithm of the recommended probability of the action taken by the agent from the policy network; 
* a list critic_value_history with the estimated future rewards from the value network; 
* rewards_history with the rewards to each action taken by the agent in the game;
* wroingmoves_history with the rewards to each action associated with wrong moves;
* a number episode_reward showing the total rewards to the agent during the game. 

In reinforcement learning, actions affect not only current period rewards, but also future rewards. This is the credit assignment problem, which is at the heart of every reinforcement learing algorithm. The solution is to give credits to previous moves by using discounted rewards. We therefore use discounted rewards to assign credits properly as follows:

In [6]:
import numpy as np

def discount_rs(r,wrong):
    discounted_rs = np.zeros(len(r))
    running_add = 0
    for i in reversed(range(0, len(r))):
        if wrong[i]==0:  
            running_add = gamma*running_add + r[i]
            discounted_rs[i] = running_add  
    return discounted_rs.tolist()

The function takes two lists, *r* and *wrong*, as inputs, where *r* are the rewards related to legal moves while *wrong* are rewards related to illegal moves. We discount *r* but not *wrong*. The output is the properly disounted reward in each time period when a legal move is made. Note that if an illegal move is made in a time period, the reward is 0 from the output. We'll add in the reward of -1 for illgal moves later at training. 

Similarly, to train the yellow player, we define a *playing_yellow()* function. The function simulates a full game, with the rule-based AI as the first player, and the actor-critic (AC) agent as the second player, as follows:

In [7]:
def playing_yellow():
    # create lists to record game history
    action_probs_history = []
    critic_value_history = []
    wrongmoves_history = []
    rewards_history = []
    state = env.reset()
    episode_reward = 0
    state, reward, done, _ = env.step(red_think3(env))
    for step in range(max_steps):
        state = state.reshape(-1,42,)
        # estimate action probabilities and future rewards
        action_probs, critic_value = model(-state)
        # record value history
        critic_value_history.append(critic_value[0, 0])
        # select action based on the policy network
        action=np.random.choice(num_actions, \
                                p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(tf.math.log(\
                                action_probs[0, action]))
        # punish the agent if there is an illegal move
        if action+1 not in env.validinputs:
            rewards_history.append(0)
            wrongmoves_history.append(-1)
        # otherwise, place the move on the game board
        else:              
        # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action+1)
            if done:
                wrongmoves_history.append(0)
                rewards_history.append(-reward)
                episode_reward += -reward 
                break
            else:
                state, reward, done, _ = env.step(red_think3(env))
                rewards_history.append(-reward)
                wrongmoves_history.append(0)
                episode_reward += -reward                 
                if done:
                    break                
    return action_probs_history,critic_value_history, \
            wrongmoves_history,rewards_history, episode_reward

The *playing_yellow()* function simulates a full game and records all the intermediate steps made by the AC agent. The function returns five values as those from the function *playing_red()*.
Most importantly, we use *-state* instead of *state* when we feed the game board to the model. This is because the game board is coded as 1 if a cell has a red disc in it, -1 if a cell has a yellow disc in it, and 0f if the cell is empty. To use the same network for both palyers, we multiply the game board by -1 so that the model treat 1 as occupied by the current player and -1 as occupied by the opponent. 

The second most important thinng is to change the rewards. Instead of using 1 and -1 to denote a win by the red and yellow player, respectively, we'll use a reward of 1 to denote the current player has won and a reward of -1 to denote that the current player has lost. We accomplish this by use *-reward* instead of *reward* in the above code cell. Therefore, after the yellow player makes a move, the reward is now 1 if the yellow player wins and -1 if the yellow player loses.

# 2. Train the Red and Yellow Players
In this section, we'll train the red and yellow players using the same model. 

## 2.1. Create Batches for Training
Instead of updating model parameters after one episode, we update after a certain number of episodes to make the model stable. Here we update parameters every ten games when training the red or the yellow player, as follows:

In [8]:
batch_size=10
def create_batch(playing_function):
    action_probs_history = []
    critic_value_history = []
    wrongmoves_history = []
    rewards_history = []
    episode_rewards = []
    for i in range(batch_size):
        aps,cvs,wms,rs,er = playing_function()
        # rewards are discounted
        returns = discount_rs(rs,wms)
        action_probs_history += aps
        critic_value_history += cvs
        # punishments for wrong moves are not discounted
        wrongmoves_history += wms
        # combined discounted rewards with punishments
        combined=np.array(returns)+np.array(wms)
        # add combined rewards to rewards history
        rewards_history += combined.tolist()
        episode_rewards.append(er)        
    return action_probs_history,critic_value_history,\
        rewards_history,episode_rewards

## 2.2. Train the Model for Both Players

To alternate between training the red player and training the yellow player, we'll create a variable *batches*. It starts with a value of 1. After each batch of training, we add 1 to the value of the variable *batches*. We'll train the red player when the value of *batches* is even and train the yellow player otherwise. 

We train the model till the average episode reward to the AC agent is 0.5 or above in the last 10000 games. 

In [9]:
from collections import deque

running_rewards=deque(maxlen=10000)
episode_count = 0
batches=0
gamma=0.95
# Train the model
while True:
    with tf.GradientTape() as tape:
        if batches%2==0:
            action_probs_history,critic_value_history,\
            rewards_history,episode_rewards=create_batch(playing_red)
        else:
            action_probs_history,critic_value_history,\
            rewards_history,episode_rewards=create_batch(playing_yellow)                                
        # Calculating loss values to update our network        
        tfdif=tf.convert_to_tensor(rewards_history,dtype=tf.float32)-\
            tf.convert_to_tensor(critic_value_history,dtype=tf.float32)
        alosses=-tf.multiply(tf.convert_to_tensor(action_probs_history,\
                            dtype=tf.float32),tfdif)
        closs=loss_func(tf.convert_to_tensor(rewards_history,dtype=tf.float32),\
             tf.convert_to_tensor(critic_value_history,dtype=tf.float32))
        # Backpropagation
        loss_value = tf.reduce_sum(alosses) + closs
    grads = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    # Log details
    episode_count += batch_size
    batches += 1
    for r in episode_rewards:
        running_rewards.append(r)
    running_reward=np.mean(np.array(running_rewards)) 
    # print out progress
    if episode_count % 1000 == 0:
        template = "running reward: {:.6f} at episode {}"
        print(template.format(running_reward, episode_count)) 
                
    # Stop if the game is solved
    if running_reward > 0.5 and episode_count>100:  
        print("Solved at episode {}!".format(episode_count))
        break 
model.save("files/ch17/ac_conn.h5")        

It takes about 24 hours to train the model. Once done, the model is saved as ac_conn.h5 on your computer. 

# 3. Play Connect Four with the Trained Model

We'll use the trained model to play against the think-three-steps-ahead rule-based AI. We'll use stochastic policy only: select the moves according to the probabilities produced by the policy network. Interested readers can try the deterministic policy yourselves and the results are similar. 

## 3.1. Define An AC Player

We define a ACplayer() function that produces the best move based on the trained AC model. 

In [10]:
model=keras.models.load_model("files/ch17/ac_conn.h5")

def ACplayer(env): 
    state = env.state.reshape(-1,42,)
    if env.turn=="red":
        action_probs, critic_value = model(state)
    else:
        action_probs, critic_value = model(-state)        
    aps=[]
    for a in sorted(env.validinputs):
        aps.append(np.squeeze(action_probs)[a-1])
    ps=np.array(aps)/np.array(aps).sum()
    return np.random.choice(sorted(env.validinputs),p=ps)



We feed the current game board to the model. It's important that if it's the yellow player's turn, we multiply the game board by -1 before feeding it into the model. We look at output from the policy network of the trained model and focus on valid moves. We normalize the probability distribution among all valid moves so the sume of probabilities add up to one. The agent takes actions based on this probability distribution. 

## 3.2. When the AC Player Moves First
The AC player chooses the next action based on the proabbiltiy distrinution recommended by the policy network from the trained model. The opponent is the think-three-steps-ahead rule-based AI we developed in Chapter 3. We play 100 games and let the AC player move first and the rule-based AI move second.

We create a list *results*. In each game, the AC agent moves first and the rule-based AI moves second. If the AC agent wins, we add an outcome 1 to *results*. If the rule-based AI wins, we add an outcome of -1 to *results*. If the game is tied, we add a 0 to *results*.

In [11]:
results=[]
for i in range(100):
    env=conn()
    state=env.reset()     
    while True:    
        action=ACplayer(env)
        state, reward, done, info = env.step(action)
        if done:
            results.append(1)
            break
        # select action based on policy network
        action=yellow_think3(env) 
        state, reward, done, info = env.step(action)
        if done:
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)
            break            

We can count how many times the AC agent has won and how many times the AC agent has lost, like this:

In [12]:
# Print out the number of games that AC won
wins=results.count(1)
print(f"The AC player has won {wins} games")
# Print out the number of games that AI lost
losses=results.count(-1)
print(f"The AC player has lost {losses} games")              
# Print out the number of tie games
ties=results.count(0)
print(f"There are {ties} tied games") 

The AC player has won 84 games
The AC player has lost 16 games
There are 0 tied games


The AC agent has won 84 games and lost 16 games. None of the games is tied. The AC agent is clearly better than the think-three-steps-ahead AI when it moves first. Next, we'll test how the AC agent fairs when it moves second.

## 3.3. When the AC Player Moves Second
Next, we test if the AC agent can beat the think-three-steps-ahead rule-based AI when it moves second. We again create an empty list *results*. We simulate 100 games. In each game, the rule-based AI moves first and the AC agent moves second. If the AC agent wins, we add an outcome 1 to *results*. If the rule-based AI wins, we add an outcome of -1 to *results*. If the game is tied, we add a 0 to *results*.

In [13]:
results=[]
for i in range(100):
    env=conn()
    state=env.reset()     
    while True:    
        action=red_think3(env)
        state, reward, done, info = env.step(action)
        if done:
            results.append(-1)
            break
        # select action based on policy network
        action=ACplayer(env)
        state, reward, done, info = env.step(action)
        if done:
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)
            break          

We look at the game outcomes below:

In [14]:
# Print out the number of games that AC won
wins=results.count(1)
print(f"The AC player has won {wins} games")
# Print out the number of games that AI lost
losses=results.count(-1)
print(f"The AC player has lost {losses} games")              
# Print out the number of tie games
ties=results.count(0)
print(f"There are {ties} tied games")  

The AC player has won 74 games
The AC player has lost 26 games
There are 0 tied games


The AC agent has won 74 games and lost 26 games. Taken together, the AC agent is clearly much better than the think-three-steps-ahead AI player. 