# Chaper 17: AlphaGo in Tic Tac Toe and Connect Four



***
***“We also introduce a new search algorithm that combines Monte Carlo simulation
with value and policy networks. Using this search algorithm, our program AlphaGo
achieved a 99.8% winning rate against other Go programs.”***

-- Google DeepMind Team, Nature (2016) 
***




In Chapter 16, you have learned the basic architecture of the AlphaGo algorithm, which combines Monte Carlo tree search with deep neural networks, as outlined in the open quote in an article published in the journal Nature in 2016. Specifically, you have implemented a basic version of the AlphaGo algorithm for the coin game, combining deep reinforcement learning with conventional rule-based AI.

In this chapter, you’ll expand this approach, creating an AlphaGo agent adaptable to various games.
The AlphaGo agent you create features two main enhancements. Firstly, it will be versatile, capable of handling two games, Tic Tac Toe and Connect Four. This flexibility reduces code redundancy and simplifies the application of the AlphaGo algorithm to a broader range of games. Secondly, the agent’s game simulation strategy includes a choice between random moves and those suggested by the fast policy network. This decision involves a trade-off: the network’s moves provide more insight but require
more processing time due to the complex neural network. In contrast, random moves accelerate game simulations, enabling more game rollouts within a given time frame and potentially smarter move selection in actual gameplay.

Monte Carlo Tree Search (MCTS) remains the core of the agent’s decision process, involving selection, expansion, simulation, and backpropagation, as outlined in Chapter 8. However, now three deep neural networks, introduced in previous chapters, will enhance the tree search. During real games, a large number of simulations will start from the current game state. Each simulation involves choosing a child node for expansion, not based on the upper confidence bounds applied to trees (UCT) formula
from Chapter 8, but on a weighted average of each node’s rollout value and its prior probability from the trained policy-gradient network. Additionally, players can choose between random moves or those from the fast policy network for rollouts. Instead of playing to a terminal state, the game state will be evaluated at a certain depth using the trained value networks, allowing more simulations within the time limit.

You’ll evaluate the AlphaGo algorithm’s effectiveness in Tic Tac Toe. Against the perfect rule-based AI from Chapter 6, the AlphaGo agent consistently draws, indicating its ability to solve the game.
Given Tic Tac Toe’s simplicity compared to Chess or Go, we’ll also explore an AlphaGo version without the value network, rolling out games to their end. Another variant will use random moves instead of those from the fast policy network. Both versions will be shown to effectively solve Tic Tac Toe, setting the stage for Chapter 20, where we implement AlphaZero in Tic Tac Toe, omitting the value network and relying solely on one policy network.

# 1. An AlphaGo Algorithm for Multiple Games


## 1.1. Functions to Select and Expand
In the local module *ch17util*, we first define a *select()* function to select a child node to expand the game tree, as follows:

```python
def select(priors,env,results,weight):    
    # weighted average of priors and rollout_value
    scores={}
    for k,v in results.items():
        # rollout_value for each next move
        if len(v)==0:
            vi=0
        else:
            vi=sum(v)/len(v)
        # scale the prior by (1+N(L))
        prior=priors[0][k-1]/(1+len(v))
        # calculate weighted average
        scores[k]=weight*prior+(1-weight)*vi
    # select child node based on the weighted average     
    return max(scores,key=scores.get) 
```

```python
# expand the game tree by taking a hypothetical move
def expand(env,move):
    env_copy=deepcopy(env)
    state,reward,done,info=env_copy.step(move)
    return env_copy, done, reward
```

## 1.2 Functions to simulate and backpropagate


```python
def best_move_fast_net(env, fast_net):
    # priors from the policy network
    if env.turn=="X":
        state = env.state.reshape(-1,3,3,1)
        action_probs=fast_net(state)
    elif env.turn=="O":
        state = env.state.reshape(-1,3,3,1)
        action_probs=fast_net(-state)
    elif env.turn=="red":
        state = env.state.reshape(-1,7,6,1)
        action_probs=fast_net(state)
    elif env.turn=="yellow":
        state = env.state.reshape(-1,7,6,1)
        action_probs=fast_net(-state)             
    ps=[]
    for a in sorted(env.validinputs):
        ps.append(np.squeeze(action_probs)[a-1])
    ps=np.array(ps)
    return np.random.choice(sorted(env.validinputs),
        p=ps/ps.sum())
```

```python
# roll out a game till terminal state or depth reached
def simulate(env_copy,done,reward,depth,value_net,
             fast_net, policy_rollout=True):
    # if the game has already ended
    if done==True:
        return reward
    # select moves based on fast policy network
    for _ in range(depth):
        if policy_rollout:
            move=best_move_fast_net(env_copy, fast_net)
        else:
            move=env_copy.sample()
        state,reward,done,info=env_copy.step(move)
        # if terminal state is reached, returns outcome
        if done==True:
            return reward
    # depth reached but game not over, evaluate
    if env_copy.turn=="X":
        state=state.reshape(-1,3,3,1)
        ps=value_net.predict(state,verbose=0)
        # reward is prob(X win) - prob(O win)
        reward=ps[0][1]-ps[0][2]        
    elif env_copy.turn=="O":
        state=state.reshape(-1,3,3,1)
        ps=value_net.predict(-state,verbose=0)
        # reward is prob(X win) - prob(O win)
        reward=ps[0][2]-ps[0][1]     
    elif env_copy.turn=="red":
        state=state.reshape(-1,7,6,1)
        ps=value_net.predict(state,verbose=0)
        # reward is prob(red win) - prob(yellow win)
        reward=ps[0][1]-ps[0][2]        
    elif env_copy.turn=="yellow":
        state=state.reshape(-1,7,6,1)
        ps=value_net.predict(-state,verbose=0)
        # reward is prob(red win) - prob(yellow win)
        reward=ps[0][2]-ps[0][1]           
    return reward
```

```python
def backpropagate(env,move,reward,results):
    # update results
    if env.turn=="X" or env.turn=="red":
        results[move].append(reward)
    # if current player is player 2, multiply outcome with -1
    if env.turn=="O" or env.turn=="yellow":
        results[move].append(-reward)                  
    return results
```

## 1.3 An AlphaGo Agent for Tic Tac Toe and Connect Four


```python
def alphago(env,weight,depth,PG_net,value_net,
        fast_net, policy_rollout=True,num_rollouts=100):
    # if there is only one valid move left, take it
    if len(env.validinputs)==1:
        return env.validinputs[0]
    # get the prior from the PG policy network
    if env.turn=="X" or env.turn=="O":
        state = env.state.reshape(-1,9)
        conv_state = state.reshape(-1,3,3,1)
        if env.turn=="X":
            priors = PG_net([state,conv_state])
        elif env.turn=="O":
            priors = PG_net([-state,-conv_state])  
    if env.turn=="red" or env.turn=="yellow":
        state = env.state.reshape(-1,42)
        conv_state = state.reshape(-1,7,6,1)
        if env.turn=="red":
            priors = PG_net([state,conv_state])
        elif env.turn=="yellow":
            priors = PG_net([-state,-conv_state])          
    # create a dictionary results
    results={}
    for move in env.validinputs:
        results[move]=[]
    # roll out games
    for _ in range(num_rollouts):
        # select
        move=select(priors,env,results,weight)
        # expand
        env_copy, done, reward=expand(env,move)
        # simulate
        reward=simulate(env_copy,done,reward,depth,value_net,
                     fast_net, policy_rollout)
        # backpropagate
        results=backpropagate(env,move,reward,results)
    # select the most visited child node
    visits={k:len(v) for k,v in results.items()}
    return max(visits,key=visits.get)
```

# 2. Test the AlphaGo Agent in Tic Tac Toe

## 2.1. The Opponent in Tic Tac Toe Games

In [1]:
from utils.ch06util import MiniMax_ab

def rule_based_AI(env):
    move = MiniMax_ab(env)
    return move 

In [2]:
from copy import deepcopy
import numpy as np
from tensorflow import keras

# Load the trained fast policy network from Chapter 10
fast_net=keras.models.load_model("files/fast_ttt.h5")
# Load the policy gradient network from Chapter 14
PG_net=keras.models.load_model("files/pg_ttt.h5")
# Load the trained value network from Chapter 14
value_net=keras.models.load_model("files/value_ttt.h5")



## 2.2. AlphaGo vs Rule-Based AI in Tic Tac Toe 


In [3]:
from utils.ttt_simple_env import ttt
from utils.ch17util import alphago

weight=0.75
depth=5
# initiate game environment
env=ttt()
# test ten games
for i in range(10):
    state=env.reset()  
    while True: 
        # AlphaGo moves first
        action=alphago(env,weight,depth,PG_net,value_net,
                fast_net, policy_rollout=True,num_rollouts=100)
        state,reward,done,_=env.step(action)     
        if done:
            if reward==0:
                print("The game is tied!")
            else:
                print("AlphaGo wins!")
            break      
        # move recommended by rule-based AI
        action=rule_based_AI(env)  
        state,reward,done,_=env.step(action)
        if done:
            print("Rule-based AI wins!")
            break  

The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!


In [4]:
# test ten games
for i in range(10):
    state=env.reset()  
    while True: 
        # Rule-based AI moves first
        action=rule_based_AI(env)
        state,reward,done,_=env.step(action)     
        if done:
            if reward==0:
                print("The game is tied!")
            else:
                print("Rule-based AI wins!")
            break      
        # AlphaGo moves second
        action=alphago(env,weight,depth,PG_net,value_net,
                fast_net, policy_rollout=True,num_rollouts=100) 
        state,reward,done,_=env.step(action)
        if done:
            print("AlphaGo wins!")
            break  

The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!
The game is tied!


# 3. Redundancy in AlphaGo


In [5]:
weight=0.8
depth=10
# create a list to record game outcome
results=[]
# test ten games
for i in range(20):
    state=env.reset()  
    if i%2==0:
        # Ruled-based AI moves
        action=rule_based_AI(env)
        state,reward,done,_=env.step(action)          
    while True:     
        # AlphaGo moves 
        action=alphago(env,weight,depth,PG_net,value_net,
                fast_net, policy_rollout=True,num_rollouts=100) 
        state,reward,done,_=env.step(action)
        if done:
            results.append(abs(reward))
            break  
        # Rule-based AI moves
        action=rule_based_AI(env)
        state,reward,done,_=env.step(action)     
        if done:
            results.append(-abs(reward))
            break             

In [6]:
# count how many times AlphaGo won
wins=results.count(1)
print(f"AlphaGo won {wins} games")
# count how many times AlphaGo lost
losses=results.count(-1)
print(f"AlphaGo lost {losses} games")
# count tie games
ties=results.count(0)
print(f"the game was tied {ties} times")

AlphaGo won 0 games
AlphaGo lost 0 games
the game was tied 20 times


In [7]:
weight=0.9
depth=10
# create a list to record game outcome
results=[]
# test ten games
for i in range(20):
    state=env.reset()  
    if i%2==0:
        # Ruled-based AI moves
        action=rule_based_AI(env)
        state,reward,done,_=env.step(action)          
    while True:     
        # AlphaGo moves, setting policy_rollout=False 
        action=alphago(env,weight,depth,PG_net,value_net,
                fast_net, policy_rollout=False,num_rollouts=100) 
        state,reward,done,_=env.step(action)
        if done:
            results.append(abs(reward))
            break  
        # Rule-based AI moves
        action=rule_based_AI(env)
        state,reward,done,_=env.step(action)     
        if done:
            results.append(-abs(reward))
            break             

In [8]:
# count how many times AlphaGo won
wins=results.count(1)
print(f"AlphaGo won {wins} games")
# count how many times AlphaGo lost
losses=results.count(-1)
print(f"AlphaGo lost {losses} games")
# count tie games
ties=results.count(0)
print(f"the game was tied {ties} times")

AlphaGo won 0 games
AlphaGo lost 0 games
the game was tied 20 times
