# Chapter 10: Policy Netwoks in Tic Tac Toe



***
*“All you need is lots and lots of data and lots of information about what the right
answer is, and you’ll be able to train a big neural net to do what you want.”*

-- Geoffrey Hinton
***



What you'll learn in this chapter:
* What is a convolutional neural network
* Building a fast policy network and a strong policy network in Tic Tac Toe
* Generating expert moves to train the two policy networks in Tic Tac Toe
* Implementing a mixed MCTS game strategy in Tic Tac Toe


In the previous chapter, you learned the basics of deep learning and
applied it to the coin game. Specifically, you generated expert moves in the
game and used them to train two policy networks: a fast policy network with just
one hidden layer and a strong policy network with three hidden layers.
In this chapter, you’ll learn to use deep neural networks to train a fast policy network

and a strong policy network in Tic Tac Toe. Different from the previous chapter, the
two neural networks in this chapter include a new type of layers, convolutional layers,
which are different from the fully-connected dense layer. While dense layers treat
inputs as one-dimensional vectors, convolutional layers treat images or game boards
as multi-dimensional objects and extract spatial features from them. Convolutional
layers can greatly improve the predictive power of neural networks. This, in turn,
makes the game strategies that use these policy networks more intelligent.

To generate expert moves in Tic Tac Toe, you’ll use the MiniMax algorithm with
alpha-beta pruning that we developed in Chapter 6. You’ll use the board positions as
inputs (Xs) and the expert moves as the targets (ys). Since there are nine potential
next moves for a player in each step, we’ll treat this as a multi-category classification
problem and use supervised learning to train the two policy neural networks by using
the board positions and expert moves. The two trained policy networks will be used
in the AlphaGo algorithm later in this book.

We’ll use the trained fast policy network later in the book to roll out games in MCTS
when implementing the AlphaGo algorithm. In contrast, we’ll use the trained strong policy network to help select which next move to use when expanding the game
tree in MCTS. To gain insight on how the strong policy network can be used in the
final AlphaGo algorithm, you’ll learn to augment the Monte Carlo Tree Search with
the trained strong policy network in this chapter. Specifically, instead of selecting
moves based on upper confidence bounds for trees (UCT) scores alone, you’ll select
moves based on both UCT scores and the probability distribution from the trained
strong policy network.We call the agent who selects moves this way the mixed MCTS
agent. You’ll show that the mixed MCTS agent is more intelligent than the traditional
MCTS agent that you developed in Chapter 8.

# 1. What Are Convolutional Layers?

In [1]:
import numpy as np

board = np.array([[1,0,0],
                   [1,-1,-1],
                   [1,0,0]]).reshape(-1,3,3,1) 

In [2]:
# Create a vertical filter
vertical_filter = np.array([[0,1,0], 
                   [0,1,0],
                   [0,1,0]]).reshape(3,3,1,1)  

In [3]:
import tensorflow as tf

# Apply the filter on the game board
result=tf.nn.conv2d(board,vertical_filter,strides=1,padding="SAME")
# Print it results
print(result.numpy().reshape(3,3))

[[ 2 -1 -1]
 [ 3 -1 -1]
 [ 2 -1 -1]]


# 2.  Deep Learning in Tic Tac Toe


## 2.2. Generate Expert Moves in Tic Tac Toe


In [4]:
from utils.ch06util import MiniMax_ab

def expert(env):
    move = MiniMax_ab(env)
    return move    

def non_expert(env):
    if np.random.rand()<0.5:
        move = MiniMax_ab(env)
    else:
        move = env.sample()
    return move  

In [5]:
from utils.ttt_simple_env import ttt
from copy import deepcopy

# Initiate the game environment
env=ttt()
# Define the one_game() function
def one_game(episode):
    history = []
    state=env.reset()  
    # The nonexpert moves firsts half the time
    if episode%2==0:
        action=non_expert(env)
        state,reward,done,_=env.step(action)
    while True:   
        action=expert(env) 
        if episode%2==0:
            statei=deepcopy(-state)
        else:
            statei=deepcopy(state)            
        actioni=deepcopy(action)
        # record board position and the move
        history.append((statei,actioni))
        state,reward,done,_=env.step(action)
        if done:
            break
        action=non_expert(env)
        state,reward,done,_=env.step(action)     
        if done:
            break
    return history

# Simulate one game and print out results
history=one_game(0)
print(history)        

[(array([ 0,  0,  0,  0,  0,  0,  0, -1,  0]), 9), (array([ 0, -1,  0,  0,  0,  0,  0, -1,  1]), 5), (array([-1, -1,  0,  0,  1,  0,  0, -1,  1]), 3), (array([-1, -1,  1,  0,  1, -1,  0, -1,  1]), 7)]


In [6]:
# simulate the game 10000 times and record all games
results = []        
for episode in range(10000):
    history=one_game(episode)
    results+=history   

In [7]:
import pickle

# save the simulation data on your computer
with open('files/games_ttt.p', 'wb') as fp:
    pickle.dump(results,fp)
# read the data and print out the first 10 observations       
with open('files/games_ttt.p', 'rb') as fp:
    games = pickle.load(fp)
print(games[:10])

[(array([ 0, -1,  0,  0,  0,  0,  0,  0,  0]), 5), (array([ 0, -1,  0,  0,  1,  0,  0, -1,  0]), 7), (array([-1, -1,  0,  0,  1,  0,  1, -1,  0]), 3), (array([0, 0, 0, 0, 0, 0, 0, 0, 0]), 6), (array([ 0,  0,  0, -1,  0,  1,  0,  0,  0]), 2), (array([ 0,  1,  0, -1,  0,  1,  0,  0, -1]), 8), (array([ 0,  1,  0, -1, -1,  1,  0,  1, -1]), 1), (array([ 1,  1, -1, -1, -1,  1,  0,  1, -1]), 7), (array([ 0,  0,  0,  0,  0,  0, -1,  0,  0]), 5), (array([ 0,  0, -1,  0,  1,  0, -1,  0,  0]), 2)]


# 3. Two Policy Networks in Tic Tac Toe


## 3.1. Create Two Neural Networks for Tic Tac Toe


In [8]:
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Flatten

fast_model = Sequential()
fast_model.add(Conv2D(filters=128, 
    kernel_size=(3,3),padding="same",activation="relu",
                 input_shape=(3,3,1)))
fast_model.add(Flatten())
fast_model.add(Dense(units=64, activation="relu"))
fast_model.add(Dense(units=64, activation="relu"))
fast_model.add(Dense(9, activation='softmax'))
fast_model.compile(loss='categorical_crossentropy',
                   optimizer='adam', 
                   metrics=['accuracy'])

In [9]:
strong_model = Sequential()
strong_model.add(Conv2D(filters=128, 
    kernel_size=(3,3),padding="same",activation="relu",
                 input_shape=(3,3,1)))
strong_model.add(Flatten())
strong_model.add(Dense(units=64, activation="relu"))
strong_model.add(Dense(units=64, activation="relu"))
strong_model.add(Dense(units=64, activation="relu"))
strong_model.add(Dense(9, activation='softmax'))
strong_model.compile(loss='categorical_crossentropy',
                   optimizer='adam', 
                   metrics=['accuracy'])

## 3.2. Train the Two Policy Networks


In [10]:
import pickle
import numpy as np
with open('files/games_ttt.p','rb') as fp:
    games=pickle.load(fp)

states=[]
actions=[]
for x in games:
    state=x[0]
    action=to_categorical(x[1]-1,9)
    states.append(state)
    actions.append(action)

X=np.array(states).reshape((-1, 3, 3, 1))
y=np.array(actions).reshape((-1, 9))

In [11]:
# Train the fast policy network for 100 epochs
fast_model.fit(X, y, epochs=100, verbose=1)
fast_model.save('files/fast_ttt.h5')

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

It takes about five minutes to train the model. Once done, we save the trained model in the local foder. 

Next, we train the strong policy network for 100 epochs as well.

In [12]:
strong_model.fit(X, y, epochs=100, verbose=1)
strong_model.save('files/strong_ttt.h5')

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

# 4. A Mixed MCTS Algorithm
 

## 4.1 Augment the UCT formula with a strong policy network


## 4.2. Mixed MCTS in Tic Tac Toe


In the local module *ch10util*, we define a *mix_select()* function as follows:

```python
gamma=10
def mix_select(env,ps,counts,wins,losses,temperature):
    # a dictionary of mixed scores for all next moves
    scores={}
    # the ones not visited get the priority
    for k in env.validinputs:
        if counts[k]==0:
            return k
    # total number of simulations conducted
    N=sum([v for k,v in counts.items()])
    # calculate scores
    for k,v in counts.items():
        # the third term based on policy network
        weighted_pi=gamma*ps[k]/(1+counts[k])       
        if v==0:
            scores[k]=weighted_pi
        else:
            # vi for each next move
            vi=(wins.get(k,0)-losses.get(k,0))/v
            # exploratoin term
            exploration=temperature*sqrt(log(N)/counts[k])
            # mixed score
            scores[k]=vi+exploration+weighted_pi
    # Select the next move with the highest UCT score
    return max(scores,key=scores.get)  
```

```python
def next_move(ps,counts,wins,losses):
    # See which action is most promising
    scores={}    
    # calculate scores
    for k,v in counts.items():
        # the third term based on policy network
        weighted_pi=gamma*ps[k]/(1+counts[k])       
        # vi for each next move
        vi=(wins.get(k,0)-losses.get(k,0))/v
        # mixed score
        scores[k]=vi+weighted_pi
    # Select the next move with the score
    return max(scores,key=scores.get)  
```

```python
def mix_mcts(env,model,num_rollouts=100,temperature=1.4):
    # if there is only one valid move left, take it
    if len(env.validinputs)==1:
        return env.validinputs[0]
    # create three dictionaries counts, wins, losses
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0
    # priors from the policy network
    state = env.state.reshape(-1,3,3,1)
    if env.turn=="X":
        action_probs= model(state)
    else:
        action_probs= model(-state)     
    ps={}
    for a in sorted(env.validinputs):
        ps[a]=np.squeeze(action_probs)[a-1]    
    # roll out games
    for _ in range(num_rollouts):
        # selection
        move=mix_select(env,ps,counts,wins,losses,temperature)
        # expansion
        env_copy, done, reward=expand(env,move)
        # simulation
        reward=simulate(env_copy,done,reward)      
        # backpropagate
        counts,wins,losses=backpropagate(\
            env,move,reward,counts,wins,losses)
    # make the move
    return next_move(ps,counts,wins,losses)
```

# 5. Mixed MCTS versus UCT MCTS
 

In [13]:
from utils.ttt_simple_env import ttt
from utils.ch10util import mix_mcts
from utils.ch08util import mcts
from tensorflow.keras.models import load_model

# load the trained strong policy network
model=load_model("files/strong_ttt.h5")

# Initiate the game environment
env=ttt()
state=env.reset() 
num_rollouts=200
results=[]
for i in range(100):
    state=env.reset() 
    if i%2==0:
        action=mcts(env,num_rollouts=num_rollouts)
        state, reward, done, info = env.step(action)
    while True:
        action=mix_mcts(env,model,num_rollouts=num_rollouts)  
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if mixed MCTS wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action=mcts(env,num_rollouts=num_rollouts)   
        state, reward, done, info = env.step(action)
        if done:
            # result is -1 if mixed MCTS loses
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break 

In [14]:
# count how many times mixed MCTS won
wins=results.count(1)
print(f"mixed MCTS won {wins} games")
# count how many times mix MCTS lost
losses=results.count(-1)
print(f"mixed MCTS lost {losses} games")  
# count how many tie games
ties=results.count(0)
print(f"the game is tied {ties} times")   

mixed MCTS won 30 games
mixed MCTS lost 7 games
the game is tied 63 times
