# Chaper 16: An Actor-Critic Agent in Tic Tac Toe

We discussed how the actor-critic method works in Chapter 15 and applied the method to the coin game. In this chapter, we'll apply the actor-critic method to Tic Tac Toe. We'll address four issues that are associated with a moderately complicated game such as Tic Tac Toe: First, how to handle illegal moves; Second, how to add a convoluational layer to the model so that we can use the spatial features on the game board to train game strategies; Third, how to use vectorization to speed up training; Fourth, how to encode the board when the game state is player dependent.  

To teach the model not to play illegal moves, we'll assign a negative reward each time the agent makes an illegal move. However, we need to address the credit assignment problem when assigning negative rewards: only the illegal move itself should be punished, not the moves before the illegal move. This is different from a sequence of moves that lead to a win or loss for the agent. 

To use the spatial features of the game board, we add a convolutional layer as the input. As a result, the model now has two input networks: one dense network and one convolutional network. The model has also two output networks: a policy network (the actor) and a value network (the critic). 

Training such a complicted network is computationally costly. To speed up training, you'll learn to use vectorization to make the program more efficient. 

In the Coin Game, the game state is player-independent in the following sense. For example, if four coins are left in the pile, the player with the turn will use the same strategy no matter whether the current player is Player 1 or Player 2. However, in Tic Tac Toe and Connect Four, the game state is player-dependent since wether Player 1 wins or Player 2 wins depends on the game pieces on the board. Therefore, in order to use the same network to train both players, we need to multiply the board by -1 when it's Player 2's turn. Specifically, we'll create one single model for Player X and Player O in Tic Tac Toe. Since a cell is represented by 1 if it's occupied by Player X, -1 if occupied by Player O, and 0 if empty, we'll multiply the game board by -1 if and only if it's Player O's turn. This way, when the board is fed into the deep neural network, it's encoded consistently: 1 indicates a cell occupied by the current player, -1 a cell occupied by the opponent, and 0 an empty cell. By making such a change, the same model can be used to predict moves for both Player X and Player O.

***
$\mathbf{\text{Create a subfolder for files in Chapter 16}}$<br>
***
We'll put all files in Chapter 16 in a subfolder /files/ch16. Run the code in the cell below to create the subfolder.

***

In [1]:
import os

os.makedirs("files/ch16", exist_ok=True)

# 1. How to Handle Illegal Moves?
In the coin game we discussed in Chapter 15, the AC agent makes one of the two moves each step: taking one or two coins from the pile. The number of legal moves is the same in every step: two. This is not the case in more complicated games such as Tic Tac Toe or Connect Four. 

In Tic Tac Toe, the first player has nine legal moves in the first step: cells 1 though 9. After that, it's Player O's turn and there are only eight legal moves left: if Player X has taken cell 1 in the first move, for example, cell 1 is not a legal move for Palyer O any more. How to train the model so that the agent avoids illegal moves? The obvious answer is to assign a negative reward each time the agent makes an illege move. However, this relates to the credit assignmnet problem in reinforcement learning and we need to handle the negative rewards with a little more care. 

## 1.1. The Credit Assignment Problem in Reinforcement Learning
In reinforcement learning, agents learn the best actions through the feedback from rewards (which can be either positive or negative). However, rewards are sparse and delayed and the agent needs to figure out how to assign proper credits to a sequence of actions that lead to a good or a bad outcome. The discounted rewards are usually the solution. 

To illustrate the point, let's assume that in a Tic Tae Toe game, the sequence of moves are the following:
* Player X takes cell 1;
* Player O takes cell 2;
* Player X takes cell 5;
* Player O takes cell 3;
* Player X takes cell 9
At this point, Player X has connected three in a row diagonally in cells 1, 5, and 9. The undiscounted rewards for the three steps made by Player X are 0, 0, and 1, respectively. However, the third step alone didn't win the game, so we should give credits to the first two moves as well by discounting rewards. Assuming the discount rate is 0.9, the discounted rewards to the three steps made by Player X are 0.81, 0.9, and 1, respectively. 

## 1.2. The Credit Assignment Problem in Illegal Moves
To train the AC agent to avoid illegal moves, we can assign negative rewards to the move each time the agent makes an illegal move. However, we should make sure that only the illegal move gets the blame, not the moves before it. 

To make the point clear, let's consider two different cases. In case 1, the following sequence of moves are made by the two players:
* Player X takes cell 1;
* Player O takes cell 5;
* Player X takes cell 2;
* Player O takes cell 3;
* Player X takes cell 4;
* Player O takes cell 7.

At this point, Player O has connected three in a row diagonally in cells 3, 5, and 7. The rewards for the three steps made by Player X are 0, 0, and -1, respectively. Assuming a discount rate of 0.9, the discounted rewards to the three steps made by Player X are -0.81, -0.9, and -1, respectively. This is approporiate because the third step alone didn't lose the game for Player X: the first two moves should get punished as well. 

Now let's consider a separate situation: 
* Player X takes cell 1;
* Player O takes cell 7;
* Player X takes cell 5;
* Player O takes cell 9;
* Player X takes cell 7;
* Player X takes cell 6;
* Player O takes cell 8.

Player X made an illegal move in cell 7 since cell 7 was already taken by Player O. So we should assign a reward of -1 to the move. However, the two previous moves made by Player X (1 and 5), should not get punished for the illegal move. We should treat the illegal move by X (7) and the three bad moves (1, 5, and 6) that lead to a loss differently as follows:
* Move 7 should be assigned a reward of -1; this reward is final and should not be discounted. 
* Moves 1, 5, and 6 should be assigned undiscounted rewards of 0, 0, and -1, respectively; Assuming the discount rate is 0.9, the discounted rewards to the three steps made by Player X (1, 5, and 6) should be 0.81, 0.9, and 1, respectively. 

To summarize, we should assign a negative reward each time the AC agent makes an illegal move, but we don't discount this reward; further, other moves made by the same player should not be get punished.

# 2. Use Vectorization to Speed Up Training
Traing deep neural networks can be extremely computationally costly. For example, to train the AlphaGo to play against Lee Sedol, DeepMind used 1920 CPUs and 280 GPUs, according to an article by the Economist (Showdown, The Economist, March 12, 2016). 

Our games are not as complicated as the Go game. However, training these simple games can still be computationally costly without the help of supercomputing facilities. Therefore, making our code more efficient will certainly reduce the computational cost. In particular, vectorization makes computations more efficient. Below we use an example to show how vectorization saves time. Assume we want to add up all natural numbers between 1 and 10,000,000. You can do this using a loop as follows:

In [2]:
import time

start=time.time()
total=0
for i in range(1,10000001):
    total += i
print(f"the sum is {total}")  
end=time.time()
print(f"the calculation took {end-start} seconds to finish")  

the sum is 50000005000000
the calculation took 0.9935817718505859 seconds to finish


The calculation took roughly 1 second to finish. Note that the number you get is likely to be different since the speed depends on the hardware configurations of your computer. 

Now, let's use vectorization to do the same calculation and see how long it takes. 

In [3]:
import numpy as np

start=time.time()
constant=np.ones(10000000)
numbers=np.arange(10000000)+1
total=total=np.matmul(constant,numbers)
print(f"the sum is {total}")  
end=time.time()
print(f"the calculation took {end-start} seconds to finish")  

the sum is 50000005000000.0
the calculation took 0.04994463920593262 seconds to finish


Here we have created two vectors: *constant* is a vector with values 1 everywhere, and *numbers* is a vector with values from 1 to 10,000,000. We use the *matmul()* method in NumPy to conduct matrix multiplication to calculate the sum. While we get the same correct answer, it took only about 0.05 seconds. In this example, we have saved 95% of the computational time. 

We'll use similar techniques when we train game strategies for the rest of the book: we'll use vectors instead of loops to speed up training. 

# 3. Use Actor-Critic to Play Tic Tac Toe

You have learned how to play the Coin Game using the actor critic approach in Chapter 15. Specifically, you have adjusted the model parameters proportional to the product of the gradients and teh advanrtage. In this section, we'll apply the same technique to Tic Tac Toe.

## 3.1. Combine Two Networks into One
Below, we'll create two separate input networks first: one is a dense layer with 9 neurons and the other network has a convolutional layer that can detect the spatial features on the Tic Tac Toe board. We'll then combine the two networks together using the *concat()* method in TensorFlow. We'll use the combined network as the input to our actor-critic model. The model has two output network, similar to the two output networks we used in Chapter 15: a policy network for the actor and a value network for the critic. 

The code in teh cell below creates such a model:

In [4]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

num_inputs = 9
num_actions = 9
num_hidden = 32
# The convolutional input layer
conv_inputs = layers.Input(shape=(3,3,1))
conv = layers.Conv2D(filters=64, kernel_size=(3,3),padding="same",
     input_shape=(3,3,1), activation="relu")(conv_inputs)
flat = layers.Flatten()(conv)
# The dense input layer
inputs = layers.Input(shape=(9,))
# Combine the two into a single input layer
two_inputs = tf.concat([flat,inputs],axis=1)
common = layers.Dense(128, activation="relu")(two_inputs)
# Policy output network (actor)
action = layers.Dense(32, activation="relu")(common)
action = layers.Dense(num_actions, activation="softmax")(action)
# Value output network (critic)
critic = layers.Dense(32, activation="relu")(common)
critic = layers.Dense(1, activation="tanh")(critic)
# The final model
model = keras.Model(inputs=[inputs, conv_inputs],\
                    outputs=[action, critic])

The model has two input networks. The first one takes in the game board as a three by three image and use a convolutional layer to extract the spatial features on the game board. The output is then falttened into a one-dimensional vector. The second input network is a dense layer with nine neurons. We combine the two input layers togehter by using the *concat()* method from TensorFlow. The combined layer is then fed into the common layer. We also create two output networks, a policy network and a value network. 

Below, we specify the optimizer and the loss function we use to train the model. 

In [5]:
optimizer=keras.optimizers.Adam(learning_rate=0.0005)
loss_func=keras.losses.MeanAbsoluteError()

The optimizer is Adam with a learning rate of 0.0005. The loss function is the mean absolute error loss function, which punishes outliner less compared to other loss functions such as Huber or mean squared error. 

## 3.2. Simulate Tic Tac Toe Games
We have shown in Chapters 4 to 6 that the minimax algorithm can solve the Tic Tac Toe game so no other algorithm can beat it. We'll train the actor-critic agent by playing against the minimax agent that we developed in Chapter 6. However, we make a couple of minor modifications to the minimax agent to speed up the game: first, whenever there is one move left, the agent takes the move without searching; second, if cell 5 (the middle cell in the middle row) is empty, the player occupies cell 5. By making these two changes, we greatly speed up the game while preserving the effectiveness of the minimax algorithm. 

For that purpose, we define the function *minimax_ttt()* in the local module ch16util. It's the same as the minimax algorithm with alpha-beta pruning we defined in Chapter 6 except the following four lines of code:

```python
    if len(env.validinputs)==1:
        return env.validinputs[0]
    if 5 in env.validinputs:
        return 5
```

Below, we import the functiuons *minimax_ttt()* as follows:

In [6]:
from utils.ch16util import minimax_ttt

To train Player X, we define a *playing_X()* function. We let the minimax agent move second to train the AC agent Player X. The function simulates a full game as follows:

In [7]:
from utils.ttt_simple_env import ttt

# allow a maximum of 50 steps per game
max_steps=50
env=ttt()

def playing_X():
    # create lists to record game history
    action_probs_history = []
    critic_value_history = []
    wrongmoves_history = []
    rewards_history = []
    state = env.reset()
    episode_reward = 0
    for step in range(max_steps):
        state = state.reshape(-1,9,)
        conv_state = state.reshape(-1,3,3,1)
        # Predict action probabilities and future rewards
        action_probs, critic_value = model([state,conv_state])
        # record value history
        critic_value_history.append(critic_value[0, 0])
        # select action based on the policy network
        action=np.random.choice(num_actions,\
                                p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(\
                        tf.math.log(action_probs[0, action]))
        # punish the agent if there is an illegal move
        if action+1 not in env.validinputs:
            rewards_history.append(0)
            wrongmoves_history.append(-1)
            #episode_reward += -1 
        # otherwise, place the move on the game board
        else:              
        # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action+1)
            if done:
                wrongmoves_history.append(0)
                rewards_history.append(reward)
                episode_reward += reward 
                break
            else:
                state,reward,done,_=env.step(minimax_ttt(env))
                rewards_history.append(reward)
                wrongmoves_history.append(0)
                episode_reward += reward                 
                if done:
                    break                
    return action_probs_history,critic_value_history, \
            wrongmoves_history,rewards_history, episode_reward

The playing_X() function simulates a full game and records all the intermediate steps made by the AC agent. The function returns five values: 
* a list action_probs_history with the natural logorithm of the recommended probability of the action taken by the agent from the policy network; 
* a list critic_value_history with the estimated future rewards from the value network; 
* rewards_history with the rewards to each action taken by the agent in the game;
* wroingmoves_history with the rewards to each action associated with wrong moves;
* a number episode_reward showing the total rewards to the agent during the game. 

The playing_X() function calculates the gradients and rewards from the game. Since in reinforcement learning, actions affect not only current period rewards, but also future rewards, we therefore use discounted rewards to assign credits properly to legal moves. The following function reflects such disounting of rewards:

In [8]:
import numpy as np

def discount_rs(r,wrong):
    discounted_rs = np.zeros(len(r))
    running_add = 0
    for i in reversed(range(0, len(r))):
        if wrong[i]==0:  
            running_add = gamma*running_add + r[i]
            discounted_rs[i] = running_add  
    return discounted_rs.tolist()

The function takes two lists, *r* and *wrong*, as inputs, where *r* are the rewards related to legal moves while *wrong* are rewards related to illegal moves. We discount *r* but not *wrong*. The output is the properly disounted reward in each time period when a legal move is made. Note that if an illegal move is made in a time period, the reward is 0 from the output. We'll add in the reward of -1 for illgal moves later at training. 

Similarly, to train Player O, we define a *playing_O()* function. The function simulates a full game, with the minimax agent as the first player, and the actor-critic (AC) agent as the second player, as follows:

In [9]:
def playing_O():
    # create lists to record game history
    action_probs_history = []
    critic_value_history = []
    wrongmoves_history = []
    rewards_history = []
    state = env.reset()
    episode_reward = 0
    state, reward, done, _ = env.step(minimax_ttt(env))
    for step in range(max_steps):
        state = state.reshape(-1,9,)
        conv_state = state.reshape(-1,3,3,1)
        # Predict action probabilities and future rewards
        action_probs, critic_value = model([-state,-conv_state])
        # record value history
        critic_value_history.append(critic_value[0, 0])
        # select action based on the policy network
        action=np.random.choice(num_actions,\
                                p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(\
                        tf.math.log(action_probs[0, action]))
        # punish the agent if there is an illegal move
        if action+1 not in env.validinputs:
            rewards_history.append(0)
            wrongmoves_history.append(-1)
            #episode_reward += -1 
        # otherwise, place the move on the game board
        else:              
        # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action+1)
            if done:
                wrongmoves_history.append(0)
                rewards_history.append(-reward)
                episode_reward += -reward 
                break
            else:
                state,reward,done,_=env.step(minimax_ttt(env))
                rewards_history.append(-reward)
                wrongmoves_history.append(0)
                episode_reward += -reward                 
                if done:
                    break                
    return action_probs_history,critic_value_history, \
            wrongmoves_history,rewards_history, episode_reward

The *playing_O()* function simulates a full game and records all the intermediate steps made by the AC agent. The function returns five values as those from the function *playing_X()*.

Most importantly, we use [-state,-conv_state] instead of [state,conv_state] when we feed the game board to the model. This is because the game board has value 1 in a cell if Player X has placed a game piece in it, -1 if Player O has placed a game piece in it. To use the same network for both palyers, we multiply the input by -1 so that the model treat 1 as occupied by the current player and -1 as occupied by the opponent. 

The second most important thinng is to change the rewards. Instead of using 1 and -1 to denote a win by the Player X and Player O, respectively, we'll use a reward of 1 to denote the current player has won and a reward of -1 to denote that the current player has lost. We accomplish this by use -reward instead of reward in the above cell. Therefore, after Player O makes a move, the reward is now 1 if Player O wins and -1 if Player O loses.

## 3.3. Train Players X and O
Instead of updating model parameters after one episode, we update after a certain number of episodes to make the model stable. Here we update parameters every ten games to train Player X, as follows:

In [10]:
batch_size=10     
def create_batch_X(batch_size):
    action_probs_history = []
    critic_value_history = []
    wrongmoves_history = []
    rewards_history = []
    episode_rewards = []
    for i in range(batch_size):
        aps,cvs,wms,rs,er = playing_X()
        # rewards are discounted
        returns = discount_rs(rs,wms)
        action_probs_history += aps
        critic_value_history += cvs
        # punishments for wrong moves are not discounted
        wrongmoves_history += wms
        # combined discounted rewards with punishments
        combined=np.array(returns)+np.array(wms)
        # add combined rewards to rewards history
        rewards_history += combined.tolist()
        episode_rewards.append(er)        
    return action_probs_history,critic_value_history,\
        rewards_history,episode_rewards

In [11]:
def create_batch_O(batch_size):
    action_probs_history = []
    critic_value_history = []
    wrongmoves_history = []
    rewards_history = []
    episode_rewards = []
    for i in range(batch_size):
        aps,cvs,wms,rs,er = playing_O()
        # reward related to legal moves are discounted
        returns = discount_rs(rs,wms)
        action_probs_history += aps
        critic_value_history += cvs
        # punishments for wrong moves are not discounted
        wrongmoves_history += wms
        # combined discounted rewards with punishments
        combined=np.array(returns)+np.array(wms)
        # add combined rewards to rewards history
        rewards_history += combined.tolist()
        episode_rewards.append(er)        
    return action_probs_history,critic_value_history,\
        rewards_history,episode_rewards

We'll train the model and update the parameters until the average episode reward to the AC agent in the last 100000 games reaches -0.005. An average score of -0.005 against a perfect player is very high. This means when the AC agent plays against the minimax agent, the percentage of games that the AC agent loses is less than 0.5%.  

In [12]:
from collections import deque

running_rewards=deque(maxlen=100000)
gamma = 0.95  
episode_count = 0
batches=0
# Train the model
while True:
    with tf.GradientTape() as tape:
        if batches%2==0:
            action_probs_history,critic_value_history,\
    rewards_history,episode_rewards=create_batch_X(batch_size)
        else:
            action_probs_history,critic_value_history,\
    rewards_history,episode_rewards=create_batch_O(batch_size)            
                     
        # Calculating loss values to update our network        
        tfdif=tf.convert_to_tensor(rewards_history,\
    dtype=tf.float32)-tf.convert_to_tensor(critic_value_history,\
                                           dtype=tf.float32)
        alosses=-tf.multiply(tf.convert_to_tensor(\
          action_probs_history,dtype=tf.float32),tfdif)

        closs=loss_func(tf.convert_to_tensor(rewards_history,\
     dtype=tf.float32),tf.convert_to_tensor(critic_value_history,\
                                            dtype=tf.float32))
        # Backpropagation
        loss_value = tf.reduce_sum(alosses) + closs
    grads = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(grads,\
                                  model.trainable_variables))
    # Log details
    episode_count += batch_size
    batches += 1
    for r in episode_rewards:
        running_rewards.append(r)
    running_reward=np.mean(np.array(running_rewards)) 
    # print out progress
    if episode_count % 1000 == 0:
        template = "running reward: {:.6f} at episode {}"
        print(template.format(running_reward, episode_count))   
    # Stop if the game is solved
    if running_reward > -0.005 and episode_count>100:  
        print("Solved at episode {}!".format(episode_count))
        break
model.save("ac_ttt.h5")           

It takes about five hours to train the model. The exact amount of time depends on your computer hardward. 

# 4. Play Tic Tac Toe with the Actor-Critic Model

We'll use the trained model to play against the rule-based AI. We'll first use the stochastic policy: select the moves according to the probabilities produced by the policy network. Later we'll also use the deterministic policy in which the AC agent chooses the move with the highest probability recommended by the policy network in the trained model. 

We'll use the trained model to play against the think-three-steps-ahead rule-based AI. We'll first let the AC agent move first and play 100 games and see how many games the AC agent can win. After that, we'll let the think-three-steps-ahead AI move first and again play 100 gains and see the game outcomes.

## 4.1. When the Actor-Critic Agent Moves First 

We'll let the AC agent play 100 games against the rule-based AI. We define a ACplayer() function that produces the best move based on the trained AC model.

In [13]:
from utils.ttt_think3 import X_think3,O_think3
import numpy as np

model=keras.models.load_model("files/ch16/AC_ttt.h5")

def ACplayer(env): 
    state = env.state.reshape(-1,9,)
    conv_state = state.reshape(-1,3,3,1)
    if env.turn=="X":
        action_probs, critic_value = model([state,conv_state])
    else:
        action_probs, critic_value = model([-state,-conv_state])        
    aps=[]
    for a in sorted(env.validinputs):
        aps.append(np.squeeze(action_probs)[a-1])
    ps=np.array(aps)/np.array(aps).sum()
    return np.random.choice(sorted(env.validinputs),p=ps)    



The AC player chooses the next action based on the probability distrinution recommended by the policy network from the trained model. Here we exclude invalid actions so only legal moves are selected.

We also define a deterministic AC player, who chooses the best move with 100% probability. 

In [14]:
def ACdeterministic(env): 
    state = env.state.reshape(-1,9,)
    conv_state = state.reshape(-1,3,3,1)
    if env.turn=="X":
        action_probs, critic_value = model([state,conv_state])
    else:
        action_probs, critic_value = model([-state,-conv_state])        
    aps=[]
    for a in sorted(env.validinputs):
        aps.append(np.squeeze(action_probs)[a-1])
    ps=np.array(aps)/np.array(aps).sum()
    return sorted(env.validinputs)[np.argmax(ps)] 

We create a list *results*. In each game, the AC agent moves first and the rule-based AI moves second. If the AC agent wins, we add an outcome 1 to *results*. If the rule-based AI wins, we add an outcome of -1 to *results*. If the game is tied, we add a 0 to *results*. 

In [15]:
results=[]
for i in range(100):
    env=ttt()
    state=env.reset()     
    while True:    
        action=ACplayer(env) 
        state, reward, done, info=env.step(action)
        if done:
            if reward!=0:
                # record 1 if AC agent won
                results.append(1)
            else:
                results.append(0) 
            break
        # select action based on policy network
        action=O_think3(env) 
        state, reward, done, info = env.step(action)
        if done:
            # record -1 if rule-based AI player won
            results.append(-1)            
            break             

We can count how many times the AC agent has won

In [16]:
# Print out the number of games that AC won
wins=results.count(1)
print(f"The AC player has won {wins} games")
# Print out the number of games that AI lost
losses=results.count(-1)
print(f"The AC player has lost {losses} games")              
# Print out the number of tie games
ties=results.count(0)
print(f"There are {ties} tied games") 

The AC player has won 72 games
The AC player has lost 0 games
There are 28 tied games


The AC agent has won 72 times and never lost. The remaining 28 games are tied. The results show that the AC agent is much better than the look-three-steps-ahead rule-based AI when it has the first-mover's advantage. 

## 4.2. When the Actor-Critic Agent Moves Second 

Next, we test if the AC agent can beat the think-three-steps-ahead rule-based AI when it moves second.

We again create an empty list *results*. We simulate 100 games. In each game, the rule-based AI moves first and the AC agent moves second. If the AC agent wins, we add an outcome 1 to *results*. If the rule-based AI wins, we add an outcome of -1 to *results*. If the game is tied, we add a 0 to *results*. 

In [17]:
results=[]
for i in range(100):
    env = ttt()
    state=env.reset()     
    while True:         
        action = X_think3(env)  
        state, reward, done, info = env.step(action)
        if done:
            if reward!=0:
                # record -1 if rule-based AI player won
                results.append(-1)
            else:
                results.append(0) 
            break
        # select action based on policy network
        action=ACdeterministic(env)
        state, reward, done, info = env.step(action)
        if done:
            # record 1 if AC agent won
            results.append(1)            
            break             

We can count how many times the AC agent has won

In [18]:
# Print out the number of games that AC won
wins=results.count(1)
print(f"The AC player has won {wins} games")
# Print out the number of games that AI lost
losses=results.count(-1)
print(f"The AC player has lost {losses} games")              
# Print out the number of tie games
ties=results.count(0)
print(f"There are {ties} tied games") 

The AC player has won 9 games
The AC player has lost 0 games
There are 91 tied games


Most games are tied. The AC agent has won 9 times and never lost. The results show that the AC agent is unbeatable.  