# Chapter 11: A Policy Netwok in Connect Four




***
***“We first trained the policy network on 30 million moves from games played by human experts, until it could predict the human move 57% of the time (the previous record before AlphaGo was 44%).”***

-- David Silver and Demis Hassabis, Google DeepMind 
***



In this chapter, you’ll continue with what you learned in Chapters 9 and 10 and apply deep learning to another game: Connect Four. 

Since the Coin game and Tic Tac Toe are solved games, generating expert moves is relatively straightforward. Connect Four is also a solved game, but implementing expert moves using the perfect rule-based algorithm (e.g., by using the method in Victor Allis' 1988 master thesis) is complicated. Further, machine learning is most useful in unsolved games such as Chess or Go that cannot rely on rule-based algorithms alone. Therefore, it's important to master machine learning skills to help unsolved games, as the DeepMind team did. As a result, we treat Connect Four as an unsolved game and use the AlphaGo algorithm to come up with strong agents. 

To generate expert moves in Connect Four, we first design a computer player who choose moves by using the MCTS algorithm 50% of the time. It chooses moves by using the MiniMax algorithm the other 50% of the time. By randomizing between two algorithms, we avoid identical games in simulations so we can cover different game scenarios. We let the above computer player play against itself for 10,000 games. In each game, the winner's moves are considered expert moves while the loser's moves are discarded. 

You'll use these expert moves to train a deep neural network. To save time, we train one network to serve as both the fast policy network and the strong policy network. The output layer has seven neurons, representing the seven possible moves the expert can take: column 1 through column 7. Essentially we are conducting a multi-category classification problem. The neural network we create includes both dense layers and a convolutional layer. You'll learn to treat the Connect Four game board as a two-dimensional image and extract spatial features from the board (several game pieces in a row horizontally, vertically, or diagonally, for example) and associate these features with expert moves. After the network is trained, you'll use it to design a game strategy to play Connect Four by using the mixed MCTS algorithm (similar to what we did in Chapter 11). Specifically, instead of selecting moves based on UCT scores alone, you'll select moves based on both UCT scores and the probability distribution from the policy network we trained. We call the agent who selects moves this way the mixed MCTS agent. Finally, you'll show that the mixed MCTS agent is more intelligent than the UCT MCTS agent that we developed in Chapter 8.

# 1.  Deep Learning in Connect Four
  

## 1.1. Steps to Train the Policy Netork in Connect Four


## 1.2. Generate Expert Moves in Connect Four


In [1]:
from utils.conn_simple_env import conn
from utils.ch08util import mcts
import numpy as np
from utils.ch06util import MiniMax_conn 

# Initiate the game environment
env=conn()
# Define a player
def player(env):
    if len(env.occupied[3])==0:
        action=4
    elif len(env.validinputs)==1:
        action=env.validinputs[0]
    else:
        # Use Minimax 50% of the time
        if np.random.uniform(0,1,1)<=0.5:
            action=MiniMax_conn(env,depth=3)
        # Use MCTS 50% of the time            
        else: 
            action=mcts(env,num_rollouts=1000)
    return action 

In [2]:
# Define the one_game() function
def one_game():
    history = []
    state=env.reset()   
    while True:   
        action=player(env) 
        history.append([np.array(state).reshape(7,6),
                        action, env.turn])    
        state, reward, done, info = env.step(action)
        if done:
            break             
    return history, reward

# Simulate one game and print out results
history,reward=one_game()
print(history)
print(reward)

[[array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]]), 4, 'red'], [array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]]), 3, 'yellow'], [array([[ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [-1,  0,  0,  0,  0,  0],
       [ 1,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0]]), 4, 'red'], [array([[ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [-1,  0,  0,  0,  0,  0],
       [ 1,  1,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0]]), 4, 'yellow'], [array([[ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [-1,  0,  0,  0,  0,  0

In [3]:
# simulate 10000 games; take hours to run
results = []        
for episode in range(10000):
    history,reward=one_game()
    results.append((reward, history))

import pickle
# save the simulation data on your computer
with open('files/games_conn.p', 'wb') as fp:
    pickle.dump(results,fp)
# read the data and print out the first 10 observations       
with open('files/games_conn.p', 'rb') as fp:
    games = pickle.load(fp) 
print("the first game is", games[0]) 

the first game is (1, [[array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]]), 4, 'red'], [array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]]), 3, 'yellow'], [array([[ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [-1,  0,  0,  0,  0,  0],
       [ 1,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0]]), 7, 'red'], [array([[ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [-1,  0,  0,  0,  0,  0],
       [ 1,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [ 1,  0,  0,  0,  0,  0]]), 6, 'yellow'], [array([[ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0],
       [

# 2. A Policy Network in Connect Four



## 2.1. Create A Neural Network for Connect Four


In [4]:
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Flatten

fast_model = Sequential()
fast_model.add(Conv2D(filters=128, 
    kernel_size=(4,4),padding="same",activation="relu",
                 input_shape=(7,6,1)))
fast_model.add(Flatten())
fast_model.add(Dense(units=64, activation="relu"))
fast_model.add(Dense(units=64, activation="relu"))
fast_model.add(Dense(7, activation='softmax'))
fast_model.compile(loss='categorical_crossentropy',
                   optimizer='adam', 
                   metrics=['accuracy'])

## 2.2. Train the Neural Network in Connect Four


In [5]:
import pickle
import numpy as np
with open('files/games_conn.p','rb') as fp:
    games=pickle.load(fp)

X=[]
y=[]
for reward, history in games:
    # games in which the red won the game
    if reward>0:
        for state, action, turn in history: 
            # we only use actions taken by red
            if turn=="red":
                X.append(state)
                y.append(action-1)
    # games in which the yellow won the game                
    if reward<0:
        for state, action, turn in history: 
            # we only use actions taken by red            
            if turn=="yellow":
                # multiply the board by -1
                X.append(-state)
                y.append(action-1)        
                
X=np.array(X).reshape((-1,7,6,1))  
y=to_categorical(y,7)  
print(X.shape, y.shape)

(129776, 7, 6, 1) (129776, 7)


In [6]:
# Train the policy network for 100 epochs
fast_model.fit(X, y, epochs=100, verbose=1)
fast_model.save('files/policy_conn.h5')

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


# 3. Mixed MCTS in Connect Four


In the local module ch11util, we define a mix_mcts_conn() function as follows:

```python
def mix_mcts_conn(env,model,num_rollouts=100,temperature=1.4):
    # if there is only one valid move left, take it
    if len(env.validinputs)==1:
        return env.validinputs[0]
    # create three dictionaries counts, wins, losses
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0
    # priors from the policy network
    state = env.state.reshape(-1,7,6,1)
    if env.turn=="red":
        action_probs= model(state)
    else:
        action_probs= model(-state)     
    ps={}
    for a in sorted(env.validinputs):
        ps[a]=np.squeeze(action_probs)[a-1]    
    # roll out games
    for _ in range(num_rollouts):
        # selection
        move=mix_select(env,ps,counts,wins,losses,temperature)
        # expansion
        env_copy, done, reward=expand(env,move)
        # simulation
        reward=simulate(env_copy,done,reward)      
        # backpropagate
        counts,wins,losses=backpropagate(\
            env,move,reward,counts,wins,losses)
    # make the move
    return next_move(ps,counts,wins,losses)
```

# 4. The Effectiveness of the Mixed MCTS


## 4.1 Manually Plaly against the Mixed MCTS Agent


In [7]:
from utils.conn_simple_env import conn
from utils.ch11util import mix_mcts_conn
from tensorflow.keras.models import load_model
model=load_model("files/policy_conn.h5")

env=conn()
state=env.reset() 
while True:
    action=mix_mcts_conn(env,model,num_rollouts=10000)
    print(f"Player {env.turn} has chosen {action}")    
    state, reward, done, info = env.step(action)
    print(f"Current state is \n{state.T[::-1]}")
    if done:
        if reward==1:
            print(f"Player {env.turn} has won!") 
        else:
            print("Game over, it's a tie!")    
        break  
    action=int(input("What's your move?"))
    print(f"Player {env.turn} has chosen {action}")    
    state, reward, done, info = env.step(action)
    print(f"Current state is \n{state.T[::-1]}")
    if done:
        print(f"Player {env.turn} has won!")  
        break

Player red has chosen 4
Current state is 
[[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0]]
What's your move?4
Player yellow has chosen 4
Current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  0  1  0  0  0]]
Player red has chosen 5
Current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  0  1  1  0  0]]
What's your move?4
Player yellow has chosen 4
Current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  0  1  1  0  0]]
Player red has chosen 6
Current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  0  1  1  1  0]]
What's your move?3
Player yellow has chosen 3
Current state is 

The agent created a double attack and won.

## 4.2 Mixed MCTS vs UCT MCTS in Connect Four


In [8]:
num_rollouts=100
results=[]
for i in range(100):
    print(i)
    state=env.reset() 
    if i%2==0:
        action=mcts(env,num_rollouts=num_rollouts)
        state, reward, done, info = env.step(action)
    while True:
        action=mix_mcts_conn(env,model,num_rollouts=num_rollouts)  
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if mixed MCTS wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action=mcts(env,num_rollouts=num_rollouts)   
        state, reward, done, info = env.step(action)
        if done:
            # result is -1 if mixed MCTS loses
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break 

In [9]:
# count how many times mixed MCTS won
wins=results.count(1)
print(f"mixed MCTS won {wins} games")
# count how many times mix MCTS lost
losses=results.count(-1)
print(f"mixed MCTS lost {losses} games")  
# count how many tie games
ties=results.count(0)
print(f"the game was tied {ties} times")      

mixed MCTS won 66 games
mixed MCTS lost 32 games
the game was tied 2 times
